Appendix C. Binary data and GridFS – MongoDB in Action

Appendix C. Binary data and GridFS

For storing images, thumbnails, audio, and other binary files, many applications rely on the file system only. Although file systems provide fast access to files, file system storage can also can lead to organizational chaos. Consider that most file systems limit the number of files per directory. If you have millions of files to keep track of, then you need to devise a strategy for organizing files into multiple directories. Another difficulty involves metadata. Since the file metadata is still stored in a database, performing an accurate backup of the files and their metadata can be incredibly complicated.

For certain use cases, it may make sense to store files in the database itself because it simplifies file organization and backup. In MongoDB, you can use the BSON binary type to store any kind of binary data. This data type corresponds to the RDBMS BLOB (binary large object) type, and it’s the basis for two flavors of binary object storage provided by MongoDB.

The first uses one document per file and is best for smaller binary objects. If you need to catalog a large number of thumbnails or MD5s, then using single-document binary storage can make life much easier. On the other hand, you might want to store large images or audio files. In this case, GridFS, a Mongo DB API for storing binary objects of any size, would be a better choice. In the next two sections, you’ll see complete examples of both storage techniques.

C.1. Simple binary storage

BSON includes a first-class type for binary data. You can use this type to store binary objects directly inside MongoDB documents. The only limit on object size is the document size limit itself, which is 16 MB as of MongoDB v2.0. Because large documents like this can tax system resources, you’re encouraged to use GridFS for any binary objects you want to store that are larger than 1 MB.

We’ll look at two reasonable uses of binary object storage in single documents. First, you’ll see how to store an image thumbnail. Then, you’ll see how to store the accompanying MD5.

C.1.1. Storing a thumbnail

Imagine you need to store a collection of image thumbnails. The code is straightforward. First, you get the image’s filename, canyon-thumb.jpg, and then read the data into a local variable. Next, you wrap the raw binary data as a BSON binary object using the Ruby driver’s BSON::Binary constructor:

require 'rubygems'
require 'mongo'

image_filename = File.join(File.dirname(__FILE__), "canyon-thumb.jpg")
image_data = File.open(image_filename).read

bson_image_data = BSON::Binary.new(image_data)

All that remains is to build a simple document to contain the binary data and then insert it into the database:

doc = {"name" => "monument-thumb.jpg",
       "data" => bson_image_data }

@con = Mongo::Connection.new
@thumbnails = @con['images']['thumbnails']
@image_id = @thumbnails.insert(doc)

To extract the binary data, fetch the document. In Ruby, the to_s method unpacks the data into a binary string, and you can use this to compare the saved data to the original:

doc = @thumbnails.find_one({"_id" => @image_id})
if image_data == doc["data"].to_s
  puts "Stored image is equal to the original file!"
end

If you run the preceding script, you’ll see a message indicating that the two files are indeed the same.

C.1.2. Storing an MD5

It’s common to store a checksum as binary data, and this marks another potential use of the BSON binary type. Here’s how you can generate an MD5 of the thumbnail and add it to the document just stored:

require 'md5'
md5 = Digest::MD5.file(image_filename).digest
bson_md5 = BSON::Binary.new(md5, BSON::Binary::SUBTYPE_MD5)

@thumbnails.update({:_id => @image_id}, {"$set" => {:md5 => bson_md5}})

Note that when creating the BSON binary object, you tag the data with SUBTYPE_MD5. The subtype is an extra field on the BSON binary type that indicates what kind of binary data is being stored. However, this field is entirely optional and has no effect on how the database stores or interprets the data.[1]

1 This wasn’t always technically true. The deprecated default subtype of 2 indicated that the attached binary data also included four extra bytes to indicate the size, and this did affect a few database commands. The current default subtype is 0, and all subtypes now store the binary payload the same way. Subtype can therefore be seen as a kind of lightweight tag to be optionally used by application developers.

It’s easy to query for the document just stored, but do notice that you exclude the data field to keep the return document small and readable:

> use images
> db.thumbnails.findOne({}, {data: 0})
{
  "_id" : ObjectId("4d608614238d3b4ade000001"),
  "md5" : BinData(5,"K1ud3EUjT49wdMdkOGjbDg=="),
  "name" : "monument-thumb.jpg"
}

See that the MD5 field is clearly marked as binary data, with the subtype and raw payload.

C.2. GridFS

GridFS is a convention for storing files of arbitrary size in MongoDB. The GridFS specification is implemented by all of the official drivers and by MongoDB’s mongofiles tool, ensuring consistent access across platforms. GridFS is useful for storing large binary objects in the database. It’s frequently fast enough to serve these object as well, and the storage method is conducive to streaming.

The term GridFS frequently leads to confusion, so two clarifications are worth making right off the bat. The first is that GridFS isn’t an intrinsic feature of MongoDB. As mentioned, it’s a convention that all the official drivers (and some tools) use to manage large binary objects in the database. Second, it’s important to clarify that GridFS doesn’t have the rich semantics of bona fide file systems. For instance, there’s no protocol for locking and concurrency, and this limits the GridFS interface to simple put, get, and delete operations. This means that if you want to update a file, you need to delete it and then put the new version.

GridFS works by dividing a large file into small, 256 KB chunks and then storing each chunk as a separate document. By default, these chunks are stored in a collection called fs.chunks. Once the chunks are written, the file’s metadata is stored in a single document in another collection called fs.files. Figure C.1 contains a simplistic illustration of this process applied to a theoretical 1 MB file called canyon.jpg.

Figure C.1. Storing a file with GridFS

That should be enough theory to use GridFS. Next we’ll see GridFS in practice through the Ruby GridFS API and the mongofiles utility.

C.2.1. GridFS in Ruby

Earlier you stored a small image thumbnail. The thumbnail took up only 10 KB and was thus ideal for keeping in a single document. The original image is almost 2 MB in size, and is therefore much more appropriate for GridFS storage. Here you’ll store the original using Ruby’s GridFS API. First, you connect to the database and then initialize a Grid object, which takes a reference to the database where the GridFS file will be stored.

Next, you open the original image file, canyon.jpg, for reading. The most basic GridFS interface uses methods to put and get a file. Here you use the Grid#put method, which takes either a string of binary data or an IO object, such as a file pointer. You pass in the file pointer and the data is written to the database.

The method returns the file’s unique object ID:

@con  = Mongo::Connection.new
@db   = @con["images"]

@grid = Mongo::Grid.new(@db)

filename = File.join(File.dirname(__FILE__), "canyon.jpg")
file = File.open(filename, "r")

file_id = @grid.put(file, :filename => "canyon.jpg")

As stated, GridFS uses two collections for storing file data. The first, normally called fs.files, keeps each file’s metadata. The second collection, fs.chunks, stores one or more chunks of binary data for each file. Let’s briefly examine these from the shell.

Switch to the images database, and query for the first entry in the fs.files collection. You’ll see the metadata for the file you just stored:

> use images
> db.fs.files.findOne()
{
  "_id" : ObjectId("4d606588238d3b4471000001"),
  "filename" : "canyon.jpg",
  "contentType" : "binary/octet-stream",
  "length" : 2004828,
  "chunkSize" : 262144,
  "uploadDate" : ISODate("2011-02-20T00:51:21.191Z"),
  "md5" : "9725ad463b646ccbd287be87cb9b1f6e"
}

These are the minimum required attributes for every GridFS file. Most are self-explanatory. You can see that this file is about 2 MB and is divided into chunks 256 KB in size. You’ll also notice an MD5. The GridFS spec requires a checksum to ensure that the stored file is the same as the original.

Each chunk stores the object ID of its file in a field called files_id. Thus you can easily count the number of chunks this file uses:

> db.fs.chunks.count({"files_id" : ObjectId("4d606588238d3b4471000001")})
8

Given the chunk size and the total file size, eight chunks is exactly what you should expect. The contents of the chunks themselves is easy to see, too. Like earlier, you’ll want to exclude the data to keep the output readable. This query returns the first of the eight chunks, as indicated by the value of n:

> db.fs.chunks.findOne({files_id: ObjectId("4d606588238d3b4471000001")},
          {data: 0})
{
  "_id" : ObjectId("4d606588238d3b4471000002"),
  "n" : 0,
  "files_id" : ObjectId("4d606588238d3b4471000001")
}

Reading GridFS files is as easy as writing them. In the following example, you use Grid#get to return an IO-like GridIO object representing the file. You can then stream the GridFS file back to the file system. Here, you read 256 KB at a time to write a copy of the original file:

image_io = @grid.get(file_id)

copy_filename = File.join(File.dirname(__FILE__), "canyon-copy.jpg")
copy = File.open(copy_filename, "w")

while !image_io.eof? do
  copy.write(image_io.read(256 * 1024))
end

copy.close

You can then verify for yourself that both files are the same:[2]

2 This code assumes that you have the diff utility installed.

$ diff -s canyon.jpg canyon-copy.jpg
Files canyon.jpg and canyon-copy.jpg are identical

That’s the basics of reading and writing GridFS files from a driver. The various GridFS APIs vary slightly, but with the foregoing examples and the basic knowledge of how GridFS works, you should have no trouble making sense of your driver’s docs.

C.2.2. GridFS with mongofiles

The MongoDB distribution includes a handy utility called mongofiles for listing, putting, getting, and deleting GridFS files using the command line. For example, you can list the GridFS files in the images database:

$ mongofiles -d images list
connected to: 127.0.0.1
canyon.jpg 2004828

You can also easily add files. Here’s how you can add the copy of the image that you wrote with the Ruby script:

$ mongofiles -d images put canyon-copy.jpg
connected to: 127.0.0.1
added file: { _id: ObjectId('4d61783326758d4e6727228f'),
              filename: "canyon-copy.jpg",
              chunkSize: 262144, uploadDate: new Date(1298233395296),
              md5: "9725ad463b646ccbd287be87cb9b1f6e", length: 2004828 }

You can again list the files to verify that the copy was written:

$ mongofiles -d images list
connected to: 127.0.0.1
canyon.jpg 2004828
canyon-copy.jpg 2004828

mongofiles supports a number of options, and you can view them with the --help parameter:

$ mongofiles --help