Chapter 8. Replication – MongoDB in Action

Chapter 8. Replication

 

In this chapter

  • Basic replication concepts
  • Administering replica sets and handling failover
  • Replica set connections, write concern, read scaling, and tagging

 

Replication is central to most database management systems because of one inevitable fact: failures happen. If you want your live production data to be available even after a failure, you need to be sure that your production databases are available on more than one machine. Replication ensures against failure, providing high availability and disaster recovery.

I begin this chapter by introducing replication in general and discussing its main use cases. I’ll then cover MongoDB’s replication through a detailed study of replica sets. Finally, I’ll describe how to connect to replicated MongoDB clusters using the drivers, how to use write concern, and how to load balance reads across replicas.

8.1. Replication overview

Replication is the distribution and maintenance of a live database server across multiple machines. MongoDB provides two flavors of replication: master-slave replication and replica sets. For both, a single primary node receives all writes, and then all secondary nodes read and apply those writes to themselves asynchronously.

Master-slave replication and replica sets use the same replication mechanism, but replica sets additionally ensure automated failover: if the primary node goes offline for any reason, then one of the secondary nodes will automatically be promoted to primary, if possible. Replica sets provide other enhancements too, such as easier recovery and more sophistical deployment topologies. For these reasons, there are now few compelling reasons to use simple master-slave replication.[1] Replica sets are thus the recommend replication strategy for production deployments; consequently, I’ll devote the bulk of this chapter to explanations and examples of replica sets, with only a brief overview of master-slave replication.

1 The only time you should opt for MongoDB’s master-slave replication is when you’d require more than 11 slave nodes, since a replica set can have no more than 12 members.

8.1.1. Why replication matters

All databases are vulnerable to failures of the environments in which they run. Replication provides a kind of insurance against these failures. What sort of failure am I talking about? Here are some of the more common scenarios:

  • The network connection between the application and the database is lost.
  • Planned downtime prevents the server from coming back online as expected. Any institution housing servers will be forced to schedule occasional downtime, and the results of this downtime aren’t always easy to predict. A simple reboot will keep a database server offline for at least a few minutes. But then there’s the question of what happens when the reboot is complete. There are times when newly installed software or hardware will prevent the operating system from starting up properly.
  • There’s a loss of power. Although most modern data centers feature redundant power supplies, nothing prevents user error within the data center itself or an extended brownout or blackout from shutting down your database server.
  • A hard drive fails on the database server. Frequently having a mean time to failure of just a few years, hard drives fail more often than you might think.[2]

    2 You can read a detailed analysis of consumer hard drive failure rates in Google’s “Failure Trends in a Large Disk Drive Population” (http://research.google.com/archive/disk_failures.pdf.

In addition to protecting against external failures, replication has been important for MongoDB in particular for durability. When running without journaling enabled, MongoDB’s data files aren’t guaranteed to be free of corruption in the event of an unclean shutdown. Without journaling, replication must always be run to guarantee a clean copy of the data files if a single node shuts down hard.

Of course, replication is desirable even when running with journaling. After all, you still want high availability and fast failover. In this case, journaling expedites recovery because it allows you to bring failed nodes back online simply by replaying the journal. This is much faster than resyncing from an existing replica or copying a replica’s data files manually.

Journaled or not, MongoDB’s replication greatly increases the reliability of the overall database deployments and is highly recommended.

8.1.2. Replication use cases

You may be surprised at how versatile a replicated database can be. In particular, replication facilitates redundancy, failover, maintenance, and load balancing. Here we take a brief look at each of these use cases.

Replication is designed primarily for redundancy. It essentially ensures that replicated nodes stay in sync with the primary node. These replicas can live in the same data center as the primary, or they can be distributed geographically as an additional failsafe. Because replication is asynchronous, any sort of network latency or partition between nodes will have no affect on the performance of the primary. As another form of redundancy, replicated nodes can also be delayed by a constant number of seconds behind the primary. This provides insurance against the case where a user inadvertently drops a collection or an application somehow corrupts the database. Normally, these operations will be replicated immediately; a delayed replica gives administrators time to react and possibly save their data.

It’s important to note that although they’re redundant, replicas aren’t a replacement for backups. A backup represents a snapshot of the database at a particular time in the past, whereas a replica is always up to date. There are cases where a data set is large enough to render backups impractical, but as a general rule, backups are prudent and recommended even when running with replication.

Another use case for replication is failover. You want your systems to be highly available, but this is possible only with redundant nodes and the ability to switch over to those nodes in an emergency. Conveniently, MongoDB’s replica sets can frequently make this switch automatically.

In addition to providing redundancy and failover, replication simplifies maintenance, usually by allowing you to run expensive operations on a node other than the primary. For example, it’s common practice to run backups against a secondary node to keep unnecessary load off the primary and to avoid downtime. Another example involves building large indexes. Because index builds are expensive, you may opt to build on a secondary node first, swap the secondary with the existing primary, and then build again on the new secondary.

Finally, replication allows you to balance reads across replicas. For applications whose workloads are overwhelmingly read-heavy, this is the easiest way to scale MongoDB. But for all its promise, scaling reads with secondaries isn’t practical if any of the following apply:

  • The allotted hardware can’t process the given workload. As an example, I mentioned working sets in the previous chapter. If your working data set is much larger than the available RAM, then sending random reads to the secondaries is still likely to result in excessive disk access, and thus slow queries.
  • The ratio of writes to reads exceeds 50%. This is an admittedly arbitrary ratio, but it’s a reasonable place to start. The issue here is that every write to the primary must eventually be written to all the secondaries as well. Therefore directing reads to secondaries that are already processing a lot of writes can sometimes slow the replication process and may not result in increased read throughput.
  • The application requires consistent reads. Secondary nodes replicate asynchronously and therefore aren’t guaranteed to reflect the latest writes to the primary node. In pathological cases, secondaries can run hours behind.

So you can balance read load with replication, but only in special cases. If you need to scale and any of the preceding conditions apply, then you’ll need a different strategy, involving sharding, augmented hardware, or some combination of the two.

8.2. Replica sets

Replica sets are a refinement on master-slave replication, and they’re the recommended MongoDB replication strategy. We’ll start by configuring a sample replica set. I’ll then describe how replication actually works, as this knowledge is incredibly important for diagnosing production issues. We’ll end by discussing advanced configuration details, failover and recovery, and best deployment practices.

8.2.1. Setup

The minimum recommended replica set configuration consists of three nodes. Two of these nodes serve as first-class, persistent mongod instances. Either can act as the replica set primary, and both have a full copy of the data. The third node in the set is an arbiter, which doesn’t replicate data, but merely acts as a kind of neutral observer. As the name suggests, the arbiter arbitrates: when failover is required, the arbiter helps to elect a new primary node. You can see an illustration of the replica set you’re about to set up in figure 8.1.

Figure 8.1. A basic replica set consisting of a primary, a secondary, and an arbiter

Start by creating a data directory for each replica set member:

mkdir /data/node1
mkdir /data/node2
mkdir /data/arbiter

Next, start each member as a separate mongod. Since you’ll be running each process on the same machine, it’s probably easiest to start each mongod in a separate terminal window:

mongod --replSet myapp --dbpath /data/node1 --port 40000
mongod --replSet myapp --dbpath /data/node2 --port 40001
mongod --replSet myapp --dbpath /data/arbiter --port 40002

If you examine the mongod log output, the first thing you’ll notice are error messages saying that the configuration can’t be found. The is completely normal:

[startReplSets] replSet can't get local.system.replset
  config from self or any seed (EMPTYCONFIG)
[startReplSets] replSet info you may need to run replSetInitiate

To proceed, you need to configure the replica set. Do so by first connecting to one of the non-arbiter mongods just started. These examples were produced running these mongod processes locally, so you’ll connect via the local hostname, in this case, arete.

Connect, and then run the rs.initiate() command:

> rs.initiate()
{
    "info2" : "no configuration explicitly specified -- making one",
    "me" : "arete:40000",
    "info" : "Config now saved locally. Should come online in about a minute
     .",
    "ok" : 1
}

Within a minute or so, you’ll have a one-member replica set. You can now add the other two members using rs.add():

> rs.add("localhost:40001")
{ "ok" : 1 }
> rs.add("arete.local:40002", {arbiterOnly: true})
{ "ok" : 1 }

Note that for the second node, you specify the arbiterOnly option to create an arbiter. Within a minute, all members should be online. To get a brief summary of the replica set status, run the db.isMaster() command:

> db.isMaster()
{
  "setName" : "myapp",
  "ismaster" : false,
  "secondary" : true,
  "hosts" : [
    "arete:40001",
    "arete:40000"
  ],
  "arbiters" : [
    "arete:40002"
  ],
  "primary" : "arete:40000",
  "maxBsonObjectSize" : 16777216,
  "ok" : 1
}

A more detailed view of the system is provided by the rs.status() method. You’ll see state information for each node. Here’s the complete status listing:

> rs.status()
{
    "set" : "myall",
    "date" : ISODate("2011-09-27T22:09:04Z"),
    "myState" : 1,
    "members" : [
        {
            "_id" : 0,
            "name" : "arete:40000",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "optime" : {
                "t" : 1317161329000,
                "i" : 1
            },
            "optimeDate" : ISODate("2011-09-27T22:08:49Z"),
            "self" : true
        },
        {
            "_id" : 1,
            "name" : "arete:40001",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 59,
            "optime" : {
                "t" : 1317161329000,
                "i" : 1
            },
            "optimeDate" : ISODate("2011-09-27T22:08:49Z"),
            "lastHeartbeat" : ISODate("2011-09-27T22:09:03Z"),
            "pingMs" : 0
        },
        {
            "_id" : 2,
            "name" : "arete:40002",
            "health" : 1,
            "state" : 7,
            "stateStr" : "ARBITER",
            "uptime" : 5,
            "optime" : {
                "t" : 0,
                "i" : 0
            },
            "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
            "lastHeartbeat" : ISODate("2011-09-27T22:09:03Z"),
            "pingMs" : 0
        }
    ],
    "ok" : 1
}

Unless your MongoDB database contains a lot of data, the replica set should come online within 30 seconds. During this time, the stateStr field of each node should transition from RECOVERING to PRIMARY, SECONDARY, or ARBITER.

Now even if the replica set status claims that replication is working, you may want to see some empirical evidence of this. So go ahead and connect to the primary node with the shell and insert a document:

$ mongo arete:40000
> use bookstore
switched to db bookstore
> db.books.insert({title: "Oliver Twist"})
> show dbs
admin (empty)
bookstore 0.203125GB
local 0.203125GB

Initial replication should occur almost immediately. In another terminal window, open a new shell instance, but this time point it to the secondary node. Query for the document just inserted; it should have arrived:

$ mongo arete:40001
> show dbs
admin (empty)
bookstore 0.203125GB
local 0.203125GB
> use bookstore switched to db bookstore
> db.books.find()
{ "_id" : ObjectId("4d42ebf28e3c0c32c06bdf20"), "title" : "Oliver Twist" }

If replication is indeed working as displayed here, then you’ve successfully configured your replica set.

It should be satisfying to see replication in action, but perhaps more interesting is automated failover. Let’s test that now. It’s always tricky to simulate a network partition, so we’ll go the easy route and just kill a node. You could kill the secondary, but that merely stops replication, with the remaining nodes maintaining their current status. If you want to see a change of system state, you need to kill the primary. A standard CTRL-C or kill -2 will do the trick. You can also connect to the primary using the shell and run db.shutdownServer().

Once you’ve killed the primary, note that the secondary detects the lapse in the primary’s heartbeat. The secondary then elects itself primary. This election is possible because a majority of the original nodes (the arbiter and the original secondary) are still able to ping each other. Here’s an excerpt from the secondary node’s log:

[ReplSetHealthPollTask] replSet info arete:40000 is down (or slow to respond)
Mon Jan 31 22:56:22 [rs Manager] replSet info electSelf 1
Mon Jan 31 22:56:22 [rs Manager] replSet PRIMARY

If you connect to the new primary node and check the replica set status, you’ll see that the old primary is unreachable:

> rs.status()
{
      "_id" : 0,
      "name" : "arete:40000",
      "health" : 1,
      "state" : 6,
      "stateStr" : "(not reachable/healthy)",
      "uptime" : 0,
      "optime" : {
        "t" : 1296510078000,
        "i" : 1
      },
      "optimeDate" : ISODate("2011-01-31T21:43:18Z"),
      "lastHeartbeat" : ISODate("2011-02-01T03:29:30Z"),
      "errmsg": "socket exception"
}

Post-failover, the replica set consists of just two nodes. Because the arbiter has no data, your application will continue to function as long as it communicates with the primary node only.[3] Even so, replication isn’t happening, and there’s now no possibility of failover. The old primary must be restored. Assuming that the old primary was shut down cleanly, you can bring it back online, and it’ll automatically rejoin the replica set as a secondary. Go ahead and try that now by restarting the old primary node.

3 Applications sometimes query secondary nodes for read scaling. If that’s happening, then this kind of failure will cause read failures. Thus it’s important to design your application with failover in mind. More on this at the end of the chapter.

That’s a clean overview of replica sets. Some of the details are, unsurprisingly, messier. In the next two sections, you’ll see how replica sets actually work, and look at deployment, advanced configuration, and how to handle tricky scenarios that may arise in production.

8.2.2. How replication works

Replica sets rely on two basic mechanisms: an oplog and a heartbeat. The oplog enables the replication of data, and the heartbeat monitors health and triggers failover. You’ll now see how both of these mechanisms work in turn. You should in the process begin to understand and predict replica set behavior, particularly in failure scenarios.

All about the oplog

At the heart of MongoDB’s replication stands the oplog. The oplog is a capped collection that lives in a database called local on every replicating node and records all changes to the data. Every time a client writes to the primary, an entry with enough information to reproduce the write is automatically added to the primary’s oplog. Once the write is replicated to a given secondary, that secondary’s oplog also stores a record of the write. Each oplog entry is identified with a BSON timestamp, and all secondaries use the timestamp to keep track of the latest entry they’ve applied.[4]

4 The BSON timestamp is a unique identifier comprising the number of seconds since the epoch and an incrementing counter.

To better see how this works, let’s look more closely at a real oplog and at the operations recorded therein. First connect with the shell to the primary node started in the previous section, and switch to the local database:

> use local
switched to db local

The local database stores all the replica set metadata and the oplog. Naturally, this database isn’t replicated itself. Thus it lives up to its name; data in the local database is supposed to be unique to the local node and therefore shouldn’t be replicated.

If you examine the local database, you’ll see a collection called oplog.rs, which is where every replica set stores its oplog. You’ll also see a few system collections. Here’s the complete output:

> show collections
me
oplog.rs
replset.minvalid
slaves
system.indexes
system.replset

replset.minvalid contains information for the initial sync of a given replica set member, and system.replset stores the replica set config document. me and slaves are used to implement write concern, described at the end of this chapter, and system.indexes is the standard index spec container.

First we’ll focus on the oplog. Let’s query for the oplog entry corresponding to the book document you added in the previous section. To do so, enter the following query. The resulting document will have four fields, and we’ll discuss each in turn:

> db.oplog.rs.findOne({op: "i"})
{ "ts" : { "t" : 1296864947000, "i" : 1 }, "op" : "i", "ns" :
  "bookstores.books", "o" : { "_id" : ObjectId("4d4c96b1ec5855af3675d7a1"),
  "title" : "Oliver Twist" }
}

The first field, ts, stores the entry’s BSON timestamp. Pay particular attention here; the shell displays the timestamp as a subdocument with two fields, t for the seconds since epoch and i for the counter. This might lead you to believe that you could query for the entry like so:

db.oplog.rs.findOne({ts: {t: 1296864947000, i: 1}})

In fact, this query returns null. To query with a timestamp, you need to explicitly construct a timestamp object. All the drivers have their own BSON timestamp constructors, and so does JavaScript. Here’s how to use it:

db.oplog.rs.findOne({ts: new Timestamp(1296864947000, 1)})

Returning to the oplog entry, the second field, op, specifies the opcode. This tells the secondary node which operation the oplog entry represents. Here you see an i, indicating an insert. After op comes ns to signify the relevant namespace (database and collection) and o, which for insert operations contains a copy of the inserted document.

As you examine oplog entries, you may notice that operations affecting multiple documents are analyzed into their component parts. For multi-updates and mass deletes, a separate entry is created in the oplog for each document affected. For example, suppose you add a few more Dickens books to the collection:

> use bookstore
db.books.insert({title: "A Tale of Two Cities"})
db.books.insert({title: "Great Expectations"})

Now with four books in the collection, let’s issue a multi-update to set the author’s name:

db.books.update({}, {$set: {author: "Dickens"}}, false, true)

How does this appear in the oplog?

> use local
> db.oplog.$main.find({op: "u"})
{ "ts" :{ "t" : 1296944149000, "i" : 1 }, "op" :  "u",
"ns" : "bookstore.books",
"o2" : {"_id" : ObjectId("4d4dcb89ec5855af365d4283")  },
"o" : {"$set" : { "author" : "Dickens" } }  }

{ "ts" :{ "t" : 1296944149000, "i" : 2 }, "op" :  "u",
"ns" : "bookstore.books",
"o2" : {"_id" : ObjectId("4d4dcb8eec5855af365d4284")  },
"o" : {"$set" : { "author" : "Dickens" } }  }

{ "ts" :{ "t" : 1296944149000, "i" : 3 }, "op" :  "u",
"ns" : "bookstore.books",
"o2" : {"_id" : ObjectId("4d4dcbb6ec5855af365d4285")  },
"o" : {"$set" : { "author" : "Dickens" } }  }

As you can see, each updated document gets its own oplog entry. This normalization is done as part of the more general strategy of ensuring that secondaries always end up with the same data as the primary. To guarantee this, every applied operation must be idempotent—it can’t matter how many times a given oplog entry is applied; the result must always be the same. Other multidocument operations, like deletes, will exhibit the same behavior. You’re encouraged to try different operations and see how they ultimately appear in the oplog.

To get some basic information about the oplog’s current status, you can run the shell’s db.getReplicationInfo() method:

> db.getReplicationInfo()
{
    "logSizeMB" : 50074.10546875,
    "usedMB" : 302.123,
    "timeDiff" : 294,
    "timeDiffHours" : 0.08,
    "tFirst" : "Thu Jun 16 2011 21:21:55 GMT-0400 (EDT)",
    "tLast" : "Thu Jun 16 2011 21:26:49 GMT-0400 (EDT)",
    "now" : "Thu Jun 16 2011 21:27:28 GMT-0400 (EDT)"
}

Here you see the timestamps of the first and last entries in this oplog. You can find these oplog entries manually by using the $natural sort modifier. For example, the following query fetches the latest entry: db.oplog.rs.find().sort({$natural: -1}) .limit(1)

The only important thing left to understand about replication is how the secondaries keep track of their place in the oplog. The answer lies in the fact that secondaries also keep an oplog. This is a significant improvement upon master-slave replication, so it’s worth taking a moment to explore the rationale.

Imagine you issue a write to the primary node of a replica set. What happens next? First, the write is recorded and then added to the primary’s oplog. Meanwhile, all secondaries have their own oplogs that replicate the primary’s oplog. So when a given secondary node is ready to update itself, it does three things. First, it looks at the timestamp of the latest entry in its own oplog. Next, it queries the primary’s oplog for all entries greater than that timestamp. Finally, it adds each of those entries to its own oplog and applies the entries to itself.[5] This means that, in case of failover, any secondary promoted to primary will have an oplog that the other secondaries can replicate from. This feature essentially enables replica set recovery.

5 When journaling is enabled, documents are written to the core data files and to the oplog simultaneously in an atomic transaction.

Secondary nodes use long polling to immediately apply new entries from the primary’s oplog. Thus secondaries will usually be almost completely up to date. When they do fall behind, because of network partitions or maintenance on secondaries themselves, the latest timestamp in each secondary’s oplog can be used to monitor any replication lag.

Halted replication

Replication will halt permanently if a secondary can’t find the point it’s synced to in the primary’s oplog. When that happens, you’ll see an exception in the secondary’s log that looks like this:

repl: replication data too stale, halting
Fri Jan 28 14:19:27 [replsecondary] caught SyncException

Recall that the oplog is a capped collection. This means that entries in the collection eventually age out. Once a given secondary fails to find the point at which it’s synced in the primary’s oplog, there’s no longer any way of ensuring that the secondary is a perfect replica of the primary. Because the only remedy for halted replication is a complete resync of the primary’s data, you’ll want to strive to avoid this state. To do that, you’ll need to monitor secondary delay, and you’ll need to have a large enough oplog for your write volume. You’ll learn more about monitoring in chapter 10. Choosing the right oplog size is what we’ll cover next.

Sizing the replication oplog

The oplog is a capped collection and as such, it can’t be resized once it’s been created (at least, as of MongoDB v2.0).[6] This makes it important to choose an initial oplog size carefully.

6 The option to increase the size of a capped collection is a planned feature. See https://jira.mongodb.org/browse/SERVER-1864.

The default oplog sizes vary somewhat. On 32-bit systems, the oplog will default to 50 MB, whereas on 64-bit systems, the oplog will be the larger of 1 GB or 5% of free disk space.[7] For many deployments, 5% of free disk space will be more than enough. One way to think about an oplog of this size is to recognize that once it overwrites itself 20 times, the disk will likely be full.

7 Unless you’re running on OS X, in which case the oplog will be 192 MB. This smaller size is due to the assumption that OSX machines are development machines.

That said, the default size won’t be ideal for all applications. If you know that your application will have a high write volume, you should do some empirical testing before deploying. Set up replication and then write to the primary at the rate you’ll have in production. You’ll want to hammer the server in this way for at least an hour. Once done, connect to any replica set member and get the current replication info:

db.getReplicationInfo()

Once you know how much oplog you’re generating per hour, you can then decide how much oplog space to allocate. You should probably shoot for being able to withstand at least eight hours of secondary downtime. You want to avoid having to completely resync any node, and increasing the oplog size will buy you time in the event of network failures and the like.

If you want to change the default oplog size, you must do so the first time you start each member node using mongod’s --oplogSize option. The value is in megabytes. Thus you can start mongod with a 1 GB oplog like so:

mongod --replSet myapp --oplogSize 1024
Heartbeat and failover

The replica set heartbeat facilitates election and failover. By default, each replica set member pings all the other members every two seconds. In this way, the system can ascertain its own health. When you run rs.status(), you see the timestamp of each node’s last heartbeat along with its state of health (1 means healthy and 0 means unresponsive).

As long as every node remains healthy and responsive, the replica set will hum along its merry way. But if any node becomes unresponsive, action may be taken. What every replica set wants is to ensure that exactly one primary node exists at all times. But this is possible only when a majority of nodes is visible. For example, look back at the replica set you built in the previous section. If you kill the secondary, then a majority of nodes still exists, so the replica set doesn’t change state but simply waits for the secondary to come back online. If you kill the primary, then a majority still exists, but there’s no primary. Therefore, the secondary is automatically promoted to primary. If more than one secondary happens to exist, then the most current secondary will be the one elected.

But there are other possible scenarios. Imagine that both the secondary and the arbiter are killed. Now the primary remains, but there’s no majority—only one of the three original nodes remains healthy. In this case, you’ll see a message like this in the primary’s log:

Tue Feb 1 11:26:38 [rs Manager] replSet can't see a majority of the set,
    relinquishing primary
Tue Feb 1 11:26:38 [rs Manager] replSet relinquishing primary state
Tue Feb 1 11:26:38 [rs Manager] replSet SECONDARY

With no majority, the primary actually demotes itself to a secondary. This may seem puzzling, but think about what might happen if this node were allowed to remain primary. If the heartbeats fail due to some kind of network partition, then the other nodes will still be online. If the arbiter and secondary are still up and able to see each other, then according to the rule of the majority, the remaining secondary will become a primary. If the original primary doesn’t step down, then you’re suddenly in an untenable situation: a replica set with two primary nodes. If the application continues to run, then it might write to and read from two different primaries, a sure recipe for inconsistency and truly bizarre application behavior. Therefore, when the primary can’t see a majority, it must step down.

Commit and rollback

One final important point to understand about replica sets is the concept of a commit. In essence, you can write to a primary node all day long, but those writes won’t be considered committed until they’ve been replicated to a majority of nodes. What do I mean by committed here? The idea can best be explained by example. Imagine again the replica set you built in the previous section. Suppose you issue a series of writes to the primary that don’t get replicated to the secondary for some reason (connectivity issues, secondary is down for backup, secondary is lagging, and so forth). Now suppose further that the secondary is suddenly promoted to primary. You write to the new primary, and eventually the old primary comes back online and tries to replicate from the new primary. The problem here is that the old primary has a series of writes that don’t exist in the new primary’s oplog. This situation triggers a rollback.

In a rollback, all writes that were never replicated to a majority are undone. This means that they’re removed from both the secondary’s oplog and the collection where they reside. If a secondary has registered a delete, then the node will look for the deleted document in another replica and restore it to itself. The same is true for dropped collections and updated documents.

The reverted writes are stored in the rollback subdirectory of the relevant node’s data path. For each collection with rolled-back writes, a separate BSON file will be created whose filename includes the time of the rollback. In the event that you need to restore the reverted documents, you can examine these BSON files using the bsondump utility and manually restore them, possibly using mongorestore.

If you ever find yourself having to restore rolled-back data, you’ll realize that this is a situation you want to avoid, and fortunately you can to some extent. If your application can tolerate the extra write latency, you can use write concern, described later, to ensure that your data is replicated to a majority of nodes on each write (or perhaps after every several writes). Being smart about write concern and about monitoring of replication lag in general will help you mitigate or even avoid the problem of rollback altogether.

In this section you learned perhaps a few more replication internals than expected, but the knowledge should come in handy. Understanding how replication works goes a long way in helping you to diagnose any issues you may have in production.

8.2.3. Administration

For all the automation they provide, replica sets have some potentially complicated configuration options. In what follows, I’ll describe these options in detail. In the interest of keeping things simple, I’ll also suggest which options can be safely ignored.

Configuration details

Here I present the mongod startup options pertaining to replica sets, and I describe the structure of the replica set configuration document.

Replication options

Earlier, you learned how to initiate a replica set using the shell’s rs.initiate() and rs.add() methods. These methods are convenient, but they hide certain replica set configuration options. Here you’ll see how to use a configuration document to initiate and update a replica set’s configuration.

A configuration document specifies the configuration of the replica set. To create one, first add a value for _id that matches the name you passed to the --replSet parameter:

> config = {_id: "myapp", members: []}
{ "_id" : "myapp", "members" : [ ] }

The individual members can be defined as part of the configuration document as follows:

config.members.push({_id: 0, host: 'arete:40000'})
config.members.push({_id: 1, host: 'arete:40001'})
config.members.push({_id: 2, host: 'arete:40002', arbiterOnly: true})

Your configuration document should now look like this:

> config
{
  "_id" : "myapp",
  "members" : [
    {
      "_id" : 0,
      "host" : "arete:40000"
    },
    {
      "_id" : 1,
      "host" : "arete:40001"
    },
    {
      "_id" : 2,
      "host" : "arete:40002",
      "arbiterOnly" : true
    }
  ]
}

You can then pass the document as the first argument to rs.initiate() to initiate the replica set.

Technically, the document consists of an _id containing the name of the replica set, an array specifying between 3 and 12 members, and an optional subdocument for specifying certain global settings. This sample replica set uses the minimum required configuration parameters, plus the optional arbiterOnly setting.

The document requires an _id that matches the replica set’s name. The initiation command will verify that each member node has been started with the --replSet option with that name. Each replica set member requires an _id consisting of increasing integers starting from 0. Members also require a host field with a host name and optional port.

Here you initiate the replica set using the rs.initiate() method. This is a simple wrapper for the replSetInitiate command. Thus you could’ve started the replica set like so:

db.runCommand({replSetInitiate: config});

config is simply a variable holding your configuration document. Once initiated, each set member stores a copy of this configuration document in the local database’s system.replset collection. If you query the collection, you’ll see that the document now has a version number. Whenever you modify the replica set’s configuration, you must also increment this version number.

To modify a replica set’s configuration, there’s a separate command, replSetReconfig, which takes a new configuration document. The new document can specify the addition or removal of set members along with alterations to both member-specific and global configuration options. The process of modifying a configuration document, incrementing the version number, and passing it as part of the replSetReconfig can be laborious, so there exist a number of shell helpers to ease the way. To see a list of them all, enter rs.help() at the shell. You’ve already seen rs.add().

Bear in mind that whenever a replica set reconfiguration results in the election of a new primary node, all client connections will be closed. This is done to ensure that clients won’t attempt to send fire-and-forget writes to a secondary node.

If you’re interested in configuring a replica set from one of the drivers, you can see how by examining the implementation of rs.add(). Enter rs.add (the method without the parentheses) at the shell prompt to see how the method works.

Configuration document options

Until now, we’ve limited ourselves to the simplest replica set configuration document. But these documents support several options for both replica set members and for the replica set as a whole. We’ll begin with the member options. You’ve seen _id, host, and arbiterOnly. Here are these plus the rest, in all their gritty detail:

  • _id(required)—A unique incrementing integer representing the member’s ID. These _id values begin at 0 and must be incremented by one for each member added.
  • host(required)—A string storing the host name of this member along with an optional port number. If the port is provided, it should be separated from the host name by a colon (for example, arete:30000). If no port number is specified, the default port, 27017, will be used.
  • arbiterOnly—A Boolean value, true or false, indicating whether this member is an arbiter. Arbiters store configuration data only. They’re lightweight members that participate in primary election but not in the replication itself.
  • priority—An integer from 0 to 1000 that helps to determine the likelihood that this node will be elected primary. For both replica set initiation and failover, the set will attempt to elect as primary the node with the highest priority, as long as it’s up to date. There are also cases where you might want a node never to be primary (say, a disaster recovery node residing in a secondary data center). In those cases, set the priority to 0. Nodes with a priority of 0 will be marked as passive in the results to the isMaster() command and will never be elected primary.
  • votes—All replica set members get one vote by default. The votes setting allows you to give more than one vote to an individual member. This option should be used with extreme care, if at all. For one thing, it’s difficult to reason about replica set failover behavior when not all members have the same number of votes. Moreover, the vast majority of production deployments will be perfectly well served with one vote per member. So if you do choose to alter the number of votes for a given member, be sure to think through and simulate the various failure scenarios very carefully.
  • hidden—A Boolean value that, when true, will keep this member from showing up in the responses generated by the isMaster command. Because the MongoDB drivers rely on isMaster for knowledge of the replica set topology, hiding a member keeps the drivers from automatically accessing it. This setting can be used in conjunction with buildIndexes and must be used with slaveDelay.
  • buildIndexes—A Boolean value, defaulting to true, that determines whether this member will build indexes. You’ll want to set this value to false only on members that will never become primary (those with a priority of 0). This option was designed for nodes used solely as backups. If backing up indexes is important, then you shouldn’t use this option.
  • slaveDelay—The number of seconds that a given secondary should lag behind the primary. This option can be used only with nodes that will never become primary. So to specify a slaveDelay greater than 0, be sure to also set a priority of 0. You can use a delayed slave as insurance against certain kinds of user errors. For example, if you have a secondary delayed by 30 minutes and an administrator accidentally drops a database, then you have 30 minutes to react to this event before it’s propagated.
  • tags—A document containing an arbitrary set of key-value pairs, usually used to identify this member’s location in a particular data center or server rack. Tags are used for specifying granular write concern and read settings, and they’re discussed in section 8.4.9.

That sums up the options for individual replica set members. There are also two global replica set configuration parameters scoped under a settings key. In the replica set configuration document, they appear like this:

{
  settings: {
    getLastErrorDefaults: {w: 1},
    getLastErrorModes: {
      multiDC: { dc: 2 }
    }
  }
}

  • getLastErrorDefaults—A document specifying the default arguments to be used when the client calls getLastError with no arguments. This option should be treated with care because it’s also possible to set global defaults for getLastError within the drivers, and you can imagine a situation where application developers call getLastError not realizing that an administrator has specified a default on the server. For more details on getLastError, see section 3.2.3 on write concern. Briefly, to specify that all writes are replicated to at least two members with a timeout of 500 ms, you’d specify this value in the config like so: settings: { getLastErrorDefaults: {w: 2, wtimeout: 500} }.
  • getLastErrorModes—A document defining extra modes for the getLastError command. This feature is dependent on replica set tagging and is described in detail in section 8.4.4.
Replica set status

You can see the status of a replica set and its members by running the replSetGetStatus command. To invoke this command from the shell, run the rs.status() helper method. The resulting document indicates the extant members and their respective states, uptime, and oplog times. It’s important to understand replica set member state; you can see a complete list of possible values in table 8.1.

Table 8.1. Replica set states

State

State string

Notes

0 STARTUP Indicates that the replica set is negotiating with other nodes by pinging all set members and sharing config data.
1 PRIMARY This is the primary node. A replica set will always have at most one primary node.
2 SECONDARY This is a secondary, read-only node. This node may become a primary in the event of a failover if and only if its priority is greater than 0 and it’s not marked as hidden.
3 RECOVERING This node is unavailable for reading and writing. You usually see this state after a failover or upon adding a new node. While recovering, a data file sync is often in progress; you can verify this by examine the recovering node’s logs.
4 FATAL A network connection is still established, but the node isn’t responding to pings. This usually indicates a fatal error on the machine hosting the node marked FATAL.
5 STARTUP2 An initial data file sync is in progress.
6 UNKNOWN A network connection has yet to be made.
7 ARBITER This node is an arbiter.
8 DOWN The node was accessible and stable at some point but isn’t currently responding to heartbeat pings.
9 ROLLBACK A rollback is in progress.

You can consider a replica set stable and online when all its nodes are in any of states 1, 2, or 7 and when at least one node is running as the primary. You can use the replSetGetStatus command from an external script to monitor overall state, replication lag, and uptime, and this is recommended for production deployments.[8]

8 Note that in addition to running the status command, you can get a useful visual through the web console. Chapter 10 discusses the web console and shows an example of its use with replica sets.

Failover and recovery

You saw in the sample replica set a couple examples of failover. Here I summarize the rules of failover and provide some suggestions on handling recovery.

A replica set will come online when all members specified in the configuration can communicate with each other. Each node is given one vote by default, and those votes are used to form a majority and elect a primary. This means that a replica set can be started with as few as two nodes (and votes). But the initial number of votes also decides what constitutes a majority in the event of a failure.

Let’s assume that you’ve configured a replica set of three complete replicas (no arbiters) and thus have the recommended minimum for automated failover. If the primary fails, and the remaining secondaries can see each other, then a new primary can be elected. As for deciding which one, the secondary with the most up-to-date oplog (or higher priority) will be elected primary.

Failure modes and recovery

Recovery is the process of restoring the replica set to its original state following a failure. There are two overarching failure categories to be handled. The first comprises what is called clean failures, where a given node’s data files can still be assumed to be intact. One example of this is a network partition. If a node loses its connections to the rest of the set, then you need only wait for connectivity to be restored, and the partitioned node will resume as a set member. A similar situation occurs when a given node’s mongod process is terminated for any reason but can be brought back online cleanly.[9] Again, once the process is restarted, it can rejoin the set.

9 For instance, if MongoDB is shut down cleanly then you know that the data files are okay. Alternatively, if running with journaling, the MongoDB instance should be recoverable regardless of how it’s killed.

The second type of failure comprises all categorical failures, where either a node’s data files no longer exist or must be presumed corrupted. Unclean shutdowns of the mongod process without journaling enabled and hard drive crashes are both examples of this kind of failure. The only ways to recover a categorically failed node are to completely replace the data files via a resync or to restore from a recent backup. Let’s look a both strategies in turn.

To completely resync, start a mongod with an empty data directory on the failed node. As long as the host and port haven’t changed, the new mongod will rejoin the replica set and then resync all the existing data. If either the host or port has changed, then after bringing the mongod back online, you’ll also have to reconfigure the replica set. As an example, suppose the node at arete:40001 is rendered unrecoverable and you bring up a new node at foobar:40000. You can reconfigure the replica set by grabbing the configuration document, modifying the host for the second node, and then passing that to the rs.reconfig() method:

> use local
> config = db.system.replset.findOne()
{
  "_id" : "myapp",
  "version" : 1,
  "members" : [
    {
      "_id" : 0,
      "host" : "arete:30000"
    },
    {
      "_id" : 1,
      "host" : "arete:30001"
    },
    {
      "_id" : 2,
      "host" : "arete:30002"
    }
  ]
}
> config.members[1].host = "foobar:40000"
arete:40000
> rs.reconfig(config)

Now the replica set will identify the new node, and the new node should start to sync from an existing member.

In addition to restoring via a complete resync, you also have the option of restoring from a recent backup. You’ll typically perform backups from one of the secondary nodes by making snapshots of the data files and then storing them offline.[10] Recovery via backup is possible only if the oplog within the backup isn’t stale relative to the oplogs of the current replica set members. This means that the latest operation in the backup’s oplog must still exist in the live oplogs. You can use the information provided by db.getReplicationInfo() to see right away if this is the case. When you do, don’t forget to take into account the time it’ll take to restore the backup. If the backup’s latest oplog entry is likely to go stale in the time it takes to copy the backup to a new machine, then you’re better off performing a complete resync.

10 Backups are discussed in detail in chapter 10.

But restoring from backup can be faster, in part because the indexes don’t have to be rebuilt from scratch. To restore from a backup, copy the backed-up data files to a mongod data path. The resync should begin automatically, and you can check the logs or run rs.status() to verify this.

Deployment strategies

You now know that a replica set can consist of up to 12 nodes, and you’ve been presented with a dizzying array of configuration options and considerations regarding failover and recovery. There are a lot of ways you might configure a replica set, but in this section I’ll present a couple that will work for the majority of cases.

The most minimal replica set configuration providing automated failover is the one you built earlier consisting of two replicas and one arbiter. In production, the arbiter can run on an application server while each replica gets its own machine. This configuration is economical and sufficient for many production apps.

But for applications where uptime is critical, you’ll want a replica set consisting of three complete replicas. What does the extra replica buy you? Think of the scenario where a single node fails completely. You still have two first-class nodes available while you restore the third. As long as a third node is online and recovering (which may take hours), the replica set can still fail over automatically to an up-to-date node.

Some applications will require the redundancy afforded by two data centers, and the three-member replica set can also work in this case. The trick is to use one of the data centers for disaster recovery only. Figure 8.2 shows an example of this. Here, the primary data center houses a replica set primary and secondary, and a backup data center keeps the remaining secondary as a passive node (with priority 0).

Figure 8.2. A three-node replica set with members in two data centers

In this configuration, the replica set primary will always be one of the two nodes living in data center A. You can lose any one node or any one data center and still keep the application online. Failover will usually be automatic, except in the cases where both of A’s nodes are lost. Because it’s rare to lose two nodes at once, this would likely represent the complete failure or partitioning of data center A. To recover quickly, you could shut down the member in data center B and restart it without the --replSet flag. Alternatively, you could start two new nodes in data center B and then force a replica set reconfiguration. You’re not supposed to reconfigure a replica set when a majority of nodes is unreachable, but you can do so in emergencies using the force option. For example, if you’ve defined a new configuration document, config, then you can force reconfiguration like so:

> rs.reconfig(config, {force: true})

As with any production system, testing is key. Make sure that you test for all the typical failover and recovery scenarios in a staging environment comparable to what you’ll be running in production. Knowing from experience how your replica set will behave in these failures cases will secure some peace of mind and give you the wherewithal to calmly deal with emergencies as they occur.

8.3. Master-slave replication

Master-slave replication is the original replication paradigm in MongoDB. This flavor of replication is easy to configure and has the advantage of supporting any number of slave nodes. But master-slave replication is no longer recommended for production deployments. There are a couple reasons for this. First, failover is completely manual. If the master node fails, then an administrator must shut down a slave and restart it as a master node. Then the application must be reconfigured to point to the new master. Second, recovery is difficult. Because the oplog exists only on the master node, a failure requires that a new oplog be created on the new master. This means that any other existing nodes will need to resync from the new master in the event of a failure.

In short, there are few compelling reasons to use master-slave replication. Replica sets are the way forward, and they’re the flavor of replication you should use.

8.4. Drivers and replication

If you’re building an application and using MongoDB’s replication, then you need to know about three application-specific topics. The first concerns connections and failover. Next comes write concern, which allows you to decide to what degree a given write should be replicated before the application continues. The final topic, read scaling, allows an application to distribute reads across replicas. I’ll discuss these topics one by one.

8.4.1. Connections and failover

The MongoDB drivers present a relatively uniform interface for connecting to replica sets.

Single-node connections

You’ll always have the option of connecting to a single node in a replica set. There’s no difference between connecting to a node designated as a replica set primary and connecting to one of the vanilla standalone nodes we’ve used for the examples throughout the book. In both cases, the driver will initiate a TCP socket connection and then run the isMaster command. This command then returns a document like the following:

{ "ismaster" : true, "maxBsonObjectSize" : 16777216, "ok" : 1 }

What’s most important to the driver is that the isMaster field be set to true, which indicates that the given node is either a standalone, a master running master-slave replication, or a replica set primary.[11] In all of these cases, the node can be written to, and the user of the driver can perform any CRUD operation.

11 The isMaster command also returns a value for the maximum BSON object size for this version of the server. The drivers then validate that all BSON objects are within this limit prior to inserting them.

But when connecting directly to a replica set secondary, you must indicate that you know you’re connecting to such a node (for most drivers, at least). In the Ruby driver, you accomplish this with the :slave_ok parameter. Thus to connect directly to the first secondary you created earlier in the chapter, the Ruby code would look like this:

@con = Mongo::Connection.new('arete', 40001, :slave_ok => true)

Without the :slave_ok argument, the driver will raise an exception indicating that it couldn’t connect to a primary node. This check is in place to keep you from inadvertently writing to a secondary node. Though such attempts to write will always be rejected by the server, you won’t see any exceptions unless you’re running the operations with safe mode enabled.

The assumption is that you’ll usually want to connect to a primary node master; the :slave_ok parameter is enforced as a sanity check.

Replica set connections

You can connect to any replica set member individually, but you’ll normally want to connect to the replica set as a whole. This allows the driver to figure out which node is primary and, in the case of failover, reconnect to whichever node becomes the new primary.

Most of the officially supported drivers provide ways of connecting to a replica set. In Ruby, you connect by creating a new instance of ReplSetConnection, passing in a list of seed nodes:

Mongo::ReplSetConnection.new(['arete', 40000], ['arete', 40001])

Internally, the driver will attempt to connect to each seed node and then call the isMaster command. Issuing this command to a replica set returns a number of important set details:

> db.isMaster()
{
  "setName" : "myapp",
  "ismaster" : true,
  "secondary" : false,
  "hosts" : [
    "arete:40000",
    "arete:40001"
  ],
  "arbiters" : [
    "arete:40002"
  ],
  "maxBsonObjectSize" : 16777216,
  "ok" : 1
}

Once a seed node responds with this information, the driver has everything it needs. Now it can connect to the primary member, again verify that this member is still primary, and then allow the user to read and write through this node. The response object also allows the driver to cache the addresses of the remaining secondary and arbiter nodes. If an operation on the primary fails, then on subsequent requests, the driver can attempt to connect to one of the remaining nodes until it can reconnect to a primary.

It’s important to keep in mind that although replica set failover is automatic, the drivers don’t attempt to hide the fact that a failover has occurred. The course of events goes something like this: First, the primary fails or a new election takes place. Subsequent requests will reveal that the socket connection has been broken, and the driver will then raise a connection exception and close any open sockets to the database. It’s now up to the application developer to decide what happens next, and this decision will depend on both the operation being performed and the specific needs of the application.

Keeping in mind that the driver will automatically attempt to reconnect on any subsequent request, let’s imagine a couple of scenarios. First, suppose that you only issue reads to the database. In this case, there’s little harm in retrying a failed read since there’s no possibility of changing database state. But now imagine that you also regularly write to the database. As stated many times before, you can write to the database with and without safe mode enabled. With safe mode, the driver appends to each write a call to the getlasterror command. This ensures that the write has arrived safely and reports any server errors back to the application. Without safe mode, the driver simply writes to the TCP socket.

If your application writes without safe mode and a failover occurs, then you’re left in an uncertain state. How many of the recent writes made it to the server? How many were lost in the socket buffer? The indeterminate nature of writing to a TCP socket makes answering these questions practically impossible. How big of a problem this is depends on the application. For logging, non-safe-mode writes are probably acceptable, since losing writes hardly changes the overall logging picture; but for users creating data in the application, non-safe-mode writes can be a disaster.

With safe mode enabled, only the most recent write is in question; it may have arrived on the server, or it may not have. At times it’ll be appropriate to retry the write, and at other times an application error should be thrown. The drivers will always raise an exception; developers can then decide how these exceptions are handled.

In any case, retrying an operation will cause the driver to attempt to reconnect to the replica set. But since drivers will differ somewhat in their replica set connection behavior, you should always consult your driver’s documentation for specifics.

8.4.2. Write concern

It should be clear now that running in safe mode by default is reasonable for most applications, as it’s important to know that writes have arrived error-free at the primary server. But greater levels of assurance are frequently desired, and write concern addresses this by allowing developers to specify the extent to which a write should be replicated before allowing the application to continue on. Technically, you control write concern via two parameters on the getlasterror command: w and wtimeout.

The first value, w, admits a few possible values, but usually indicates the total number of servers that the latest write should be replicated to; the second is a timeout that causes the command to return an error if the write can’t be replicated in the specified number of milliseconds.

For example, if you want to make sure that a given write is replicated to at least one server, then you can indicate a w value of 2. If you want the operation to time out if this level of replication isn’t achieved in 500 ms, you include a wtimeout of 500. Note that if you don’t specify a value for wtimeout, and the replication for some reason never occurs, then the operation will block indefinitely.

When using a driver, you enable write concern not by calling getLastError explicitly but rather by creating a write concern object or by setting the appropriate safe mode option; it depends on the specific driver’s API.[12] In Ruby, you can specify a write concern on a single operation like so:

12 There are examples of setting write concern in Java, PHP, and C++ in appendix D.

@collection.insert(doc, :safe => {:w => 2, :wtimeout => 200})

Sometimes you simply want to ensure that a write is replicated to a majority of available nodes. For this case, you can set a w value of majority:

@collection.insert(doc, :safe => {:w => "majority"})

Even fancier options exist. For instance, if you’ve enabled journaling, then you can also force that the journal be synced to disk by adding the j option:

@collection.insert(doc, :safe => {:w => 2, :j => true})

Many drivers also support setting default write concern values for a given connection or database. To find out how to set write concern in your particular case, check your driver’s documentation. A few more language examples can be found in appendix D.

Write concern works with both replica sets and master-slave replication. If you examine the local databases, you’ll see a couple collections, me on secondary nodes and slaves on the primary node, that are used to implement write concern. Whenever a secondary polls a primary, the primary makes a note of the latest oplog entry applied to each secondary in its slaves collection. Thus, the primary knows what each secondary has replicated at all times and can therefore reliably answer the getlasterror command’s write requests.

Keep in mind that using write concern with values of w greater than 1 will introduce extra latency. Configurable write concern essentially allows you to make the trade-off between speed and durability. If you’re running with journaling, then a write concern with w equal to 1 should be fine for most applications. On the other hand, for logging or analytics, you might elect to disable journaling and write concern altogether and rely solely on replication for durability, allowing that you may lose some writes in the event of a failure. Consider these trade-offs carefully and test the different scenarios when designing your application.

8.4.3. Read scaling

Replicated databases are great for read scaling. If a single server can’t handle the application’s read load, then you have the option to route queries to more than one replica. Most of the drivers have built-in support for sending queries to secondary nodes. With the Ruby driver, this is provided as an option on the ReplSetConnection constructor:

Mongo::ReplSetConnection.new(['arete', 40000],
          ['arete', 40001], :read => :secondary )

When the :read argument is set to :secondary, the connection object will choose a random, nearby secondary to read from.

Other drivers can be configured to read from secondaries by setting a slaveOk option. When the Java driver is connected to a replica set, setting slaveOk to true will enable secondary load balancing on a per-thread basis. The load balancing implementations found in the drivers are designed to be generally applicable, so they may not work for all apps. When that’s the case, users frequently customize their own. As usual, consult your driver’s documentation for specifics.

Many MongoDB users scale with replication in production. But there are three cases where this sort of scaling won’t be sufficient. The first concerns the number of servers needed. As of MongoDB v2.0, replica sets support a maximum of 12 members, 7 of which can vote. If you need even more replicas for scaling, you can use master-slave replication. But if you don’t want to sacrifice automated failover and you need to scale beyond the replica set maximum, then you’ll need to migrate to a sharded cluster.

The second case involves applications with a high write load. As mentioned at the beginning of the chapter, secondaries must keep up with this write load. Sending reads to write-laden secondaries may inhibit replication.

A third situation that replica scaling can’t handle is consistent reads. Because replication is asynchronous, replicas aren’t always going to reflect the latest writes to the primary. Therefore, if your application reads arbitrarily from secondaries, then the picture presented to end users isn’t always guaranteed to be fully consistent. For applications whose main purpose is to display content, this almost never presents a problem. But other apps, where users are actively manipulating data, will require consistent reads. In these cases, you have two options. The first is to separate the parts of the application that need consistent reads from the parts that don’t. The former can always be read from the primary, and the latter can be distributed to secondaries. When this strategy is either too complicated or simply doesn’t scale, sharding is the way to go.[13]

13 Note that to get consistent reads from a sharded cluster, you must always read from the primary nodes of each shard, and you must issue safe writes.

8.4.4. Tagging

If you’re using either write concern or read scaling, you may find yourself wanting more granular control over exactly which secondaries receive writes or reads. For example, suppose you’ve deployed a five-node replica set across two data centers, NY and FR. The primary data center, NY, contains three nodes, and the secondary data center, FR, contains the remaining two. Let’s say that you want to use write concern to block until a certain write has been replicated to at least one node in data center FR. With what you know about write concern right now, you’ll see that there’s no good way to do this. You can’t use a w value of majority, since this will translate into a value of 3, and the most likely scenario is that the three nodes in NY will acknowledge first. You could use a value of 4, but this won’t hold up well if, say, you lose one node from each data center.

Replica set tagging solves this problem by allowing you to define special write concern modes that target replica set members with certain tags. To see how this works, you first need to learn how to tag a replica set member. In the config document, each member can have a key called tags pointing to an object containing key-value pairs. Here’s an example:

{
  "_id" : "myapp",
  "version" : 1,
  "members" : [
    {
      "_id" : 0,
      "host" : "ny1.myapp.com:30000",
      "tags": { "dc": "NY", "rackNY": "A" }
    },
    {
      "_id" : 1,
      "host" : "ny2.myapp.com:30000",
      "tags": { "dc": "NY", "rackNY": "A" }
    },
    {
      "_id" : 2,
      "host" : "ny3.myapp.com:30000",
      "tags": { "dc": "NY", "rackNY": "B" }
    },
    {
      "_id" : 3,
      "host" : "fr1.myapp.com:30000",
      "tags": { "dc": "FR", "rackFR": "A" }
    },
    {
      "_id" : 4,
      "host" : "fr2.myapp.com:30000",
      "tags": { "dc": "FR", "rackFR": "B" }
    }
  ],

  settings: {
    getLastErrorModes: {
      multiDC: { dc: 2 } },
      multiRack: { rackNY: 2 } },
    }
  }
}

This is a tagged configuration document for the hypothetical replica set spanning two data centers. Note that each member’s tag document has two key-value pairs: the first identifies the data center and the second names the local server rack for the given node. Keep in mind that the names used here are completely arbitrary and only meaningful in the context of this application; you can put anything in a tag document. What’s most important is how you use it.

That’s where getLastErrorModes come into play. These allow you to define modes for the getLastError command that implement specific write concern requirements. In the example, you’ve defined two of these. The first, multiDC, is defined as { "dc": 2}, which indicates that a write should be replicated to nodes tagged with at least two different values for dc. If you examine the tags, you’ll see this will necessarily ensure that the write has been propagated to both data centers. The second mode specifies that at least two server racks in NY should have received the write. Again the tags should make this clear.

In general, a getLastErrorModes entry consists of a document with one or more keys (in this case, dc and rackNY) whose values are integers. These integers indicate the number of different tagged values for the key that must be satisfied for the getLastError command to complete successfully. Once you’ve define these modes, you can use them as values for w in your application. For example, using the first mode in Ruby looks like this:

@collection.insert(doc, :safe => {:w => "multiDC"})

In addition to making write concern more sophisticated, tagging promises to provide more granular control over which replicas are used for read scaling. Unfortunately, at the time of this writing, the semantics for reading against tags haven’t been defined or implemented in the official MongoDB drivers. For updates, follow the issue for the Ruby driver at https://jira.mongodb.org/browse/RUBY-326.

8.5. Summary

It should be clear from all that we’ve discussed that replication is incredibly useful and that it’s also essential for most deployments. MongoDB’s replication is supposed to be easy, and setting it up usually is. But when it comes to backing up and failing over, there are bound to be hidden complexities. For these complex cases, let experience, and some help from this chapter, breed familiarity.