How does CouchDB work?
So how does CouchDB get its fancy features? Mostly by not being very much like a conventional database to use. (It does ultimately provide many of the same features of conventional databases, just in slightly different ways.) Instead of manipulating the database with SQL, you give it: (1) an array of documents expressed as JSON objects; and (2) some Javascript code in the form of two functions
map() and
reduce() (described below) that together convert the array of documents into what can be thought of as a
single hash/associative array. CouchDB then makes it possible to efficiently search this hash by key.
Even with multiple "index" hashes (created by multiple map reduce functions), only being able to "query" your database by searching through a hash by key (no substring matching! no boolean expressions!) may seem limiting. As it turns out, though, many of the queries you'd otherwise do in SQL are able to be expressed in this way. For example, suppose the "documents" in your database store information about each hit to your website. (This sort of data set isn't the best application for CouchDB, but it's what I've been playing with.) The array of documents might look like this:
[
{
url: "http://www.ibuildings.com/",
referrer: "http://www.google.com/search?q=ibuildings",
gmtime: "2008-09-04 11:23:32",
userid: "42b3a5f2"
},
{
url: "http://www.ibuildings.com/contact/",
referrer: "http://www.ibuildings.com/",
gmtime: "2008-09-04 11:24:01",
userid: "42b3a5f2"
},
{
url: "http://www.ibuildings.com/",
referrer: "",
gmtime: "2008-09-04 11:27:13",
userid: "7732ba3f"
},
// ...
]
The total number of hits to each URL could be made available in a hash like:
{
"http://www.ibuildings.com/": 743,
"http://www.ibuildings.com/about/": 16,
"http://www.ibuildings.com/contact/": 43,
// ...
}
Or the referrers to each URL in the last month:
{
"http://www.ibuildings.com/": {
"http://www.example.com/": 3,
"http://www.google.com/search?q=ibuildings": 5,
"http://www.foo.com/": 1,
// ...
},
"http://www.ibuildings.com/contact/": {
"http://www.ibuildings.com/": 1,
// ...
},
// ...
}
The pages visited by user id "42b3a5f2", and the times of visit:
{
"42b3a5f2": {
"http://www.ibuildings.com/": "2008-09-04 11:23:32",
"http://www.ibuildings.com/contact/": "2008-09-04 11:24:01"
},
"7732ba3f": {
"http://www.ibuildings.com/": "2008-09-04 11:27:13"
},
// ...
}
The examples above show how hashes might store the information we want, but how are they created in a way that scales readily across multiple CPUs in the first place? This comes from the cunning of MapReduce.
The MapReduce functions
A
map function takes a single document as input, and returns an array (an empty array is possible) of key-value pairs as output. The map function must be
referentially transparent, which is to say that given the same input, it must produce exactly the same output. (So among other things, the result can't vary according to the time of day, a random number or the value of a changing global variable.)
A
reduce function takes an array as input, and returns a
single value as output. (The value can be a complex type, such as an array or hash.) The input array is made up of values that are either: (a) the values from key-value pairs (as produced by the map function) that happen to have exactly the same key; or (b) the return value of a reduce function. The reduce function needs to be both associative and commutative, which means that it must produce the same result if the input array is randomly shuffled and also if part of the array has already been passed to the reduce function.
That is, the following must all produce exactly the same result:
reduce($v1, $v2, $v3, $v4);
reduce($v2, $v1, $v3, $v4);
reduce(reduce($v3, $v4), $v1, $v2);
reduce(reduce($v4), reduce($v2, reduce($v1, $v3)));
It may seem difficult to write a function that satisfies the latter constraint (and often it is), but such functions are more common than you might think: if $v1, $v2, $v3, $v3 are integers, for example, then a reduce function that simply returns the sum of its arguments is both associative and commutative. (Multiplication is also both associative and commutative.)
An example: calculating the total number of hits to each URL
Each one of the hashes shown above can be produced by a MapReduce combination. For the total number of hits to each URL, you need the
map function:
function(doc) {
emit(doc.url, 1);
}
(Note that CouchDB map functions use
emit() to "return" a value instead of returning an array of values as described above. Conceptually, though, this amounts to the same thing. Also note that multiple
emit() calls are allowed.)
And a
reduce function:
function(keys, values) {
var total = 0;
for (var i = 0; i < values.length; i++) {
total += values[i];
}
return total;
}
The
emit(doc.url, 1) in the
map function looks weird , and would seem to produce output of little use. The output can be thought of as a table with one column of URLs, and another column with the constant value of 1.
| Key | Value |
|---|
| http://www.ibuildings.com/ | 1 |
| http://www.ibuildings.com/ | 1 |
| http://www.ibuildings.com/contact/ | 1 |
| http://www.ibuildings.com/ | 1 |
| ... |
CouchDB converts this into another table with two columns: a column of unique URLs, and a column of values that correspond to the same URL.
| URL | Value |
|---|
| http://www.ibuildings.com/ | [ 1, 1, 1 ] |
| http://www.ibuildings.com/contact/ | [ 1 ] |
| ... |
The values of this table can be passed to the reduce function, whose return values CouchDB assembles in turn into a third table.
| URL | Value |
|---|
| http://www.ibuildings.com/ | 3 |
| http://www.ibuildings.com/contact/ | 1 |
| ... |
This table is almost identical to the "total number of hits to each URL" hash given above.
(Note that to save memory and to enable it to scale across CPUs, CouchDB almost certainly won't create each of these tables in full: due to the properties of the reduce function, passing in partial results and having CouchDB merge the output together produces the same answer as passing in full results in the first place.)
Examples of other MapReduce functions
Count of URLs by date
Map:
function(doc) {
emit(doc.gmtime.substr(0, 4), 1);
emit(doc.gmtime.substr(0, 7), 1);
emit(doc.gmtime.substr(0, 10), 1);
}
Reduce:
function (keys, values) {
var total = 0;
for (var i = 0; i < values.length; i++) {
total += values[i];
}
return total;
}
Result (a hash with dates as keys, and count as values):
$ curl http://127.0.0.1:5984/mydatabase/_view/myview/count_by_date?key=%222008-07-14%22
{"rows":[{"key":null,"value":690}]}
Note that the value of the key must be JSON-encoded, even if it's a string. This took me longer than it should've to figure out!
The resultant MapReduce hash has keys for years (
2008), months (
2008-07) and days (
2008-07-14).
Referrers to a URL
Map:
function(doc) {
if (doc.url && doc.referrer) {
var referrer = doc.referrer;
if (referrer.indexOf("?") != -1) {
referrer = referrer.substr(0, referrer.indexOf("?"));
}
emit(doc.url, referrer);
}
}
Reduce:
function (keys, values) {
var hash = { };
for (var i = 0; i < values.length; i++) {
var v = values[i];
// v could be a string or a hash, depending on whether
// it's the return value of the map() function or the
// return value of a previous reduce().
switch (typeof v) {
case 'string':
if (!hash[v]) hash[v] = 0;
hash[v] += 1;
break;
case 'object':
for (k in v) {
if (!hash[k]) hash[k] = 0;
hash[k] += v[k];
}
break;
}
}
return hash;
}
Result (a hash with URLs as keys, and a hash of referring URLs with counts as values):
$ curl http://127.0.0.1:5984/mydatabase/_view/myview/url_referrer?key=%22http://www.ibuildings.com/contact/%22
{"rows":[{"key":null,"value":{"http:\/\/www.google.com\/search":2,"http:\/\/search.yahoo.com\/search":3,"http:\/\/www.ibuildings.com\/":4}}]}
Note: This MapReduce pair isn't the recommended way to generate this result; for
performance reasons you should
emit() [ url, referrer ] pairs, use the
group=true argument to the reduce function, and count the results on the client side. (This seems to be an implementation artifact; the original
MapReduce paper doesn't mention this limitation.)
Other comments and notes
- All interaction with CouchDB is done over HTTP, via a REST-style interface, and it comes with a nice web-based interface called Futon. Futon is sufficient for browsing the database and creating "views" (i.e. the hash that results from applying a MapReduce pair) but unfortunately, you can only add items one at a time; there's no way to do a bulk upload of data, and the CouchDB documentation is not very helpful on this point. This is how I got some test data into CouchDB:
- Create the database, e.g. mydatabase, with Futon. Or use
curl:
$ curl -X PUT http://127.0.0.1:5984/mydatabase/
{"ok":true}
- POST "{}" (i.e. a JSON-encoded empty hash) to http://127.0.0.1:5984/mydatabase/. You'll get back a JSON-encoded hash with
id and rev keys defined; remember these.
$ curl -X POST -d "{}" http://127.0.0.1:5984/mydatabase/
{"ok":true,"id":"90f88c98fc380b08ad81e9471d940afb","rev":"3576644701"}
The id uniquely identifies the document.
- PUT a JSON-encoded hash (with whatever key-value pairs you like, and with
_id and _rev key-value pairs added) to the URL http://127.0.0.1:5984/mydatabase/90f88c98fc380b08ad81e9471d940afb, where the hash at the end is the value of the _id key.
$ curl -X PUT -d '{"_id": "90f88c98fc380b08ad81e9471d940afb", "_rev": "3576644701", \
"name": "Michael", "email": "michael@ibuildings.com"}' \
http://127.0.0.1:5984/mydatabase/90f88c98fc380b08ad81e9471d940afb
{"ok":true,"id":"90f88c98fc380b08ad81e9471d940afb","rev":"3176158736"}
I wrote some quick Perl code to do bulk inserts but it's really horrible so you're probably best off writing your own...
- If you're on OS X, save yourself the trouble of building CouchDB and all its dependencies yourself (at time of writing the MacPorts version doesn't work) and download CouchDBX. There's no install required and the whole thing is a standard OS X application with a small GUI for starting and stopping CouchDB and launching Futon.
- The outline of how CouchDB works above probably gives the impression that the keys of the resultant MapReduce hash are limited to strings, the way they are in many programming languages, including Javascript and PHP. However, in CouchDB, keys as well as values can be arbitrary Javascript objects. In addition, the MapReduce hash is sorted by key value, and there's a well-defined sort order, which helps ameliorate the problem of sorting results, which often can't be done within CouchDB. (You get "ORDER BY" on keys for free, but you can't sort the hash's values.)
- It's not possible to perform a MapReduce operation on the result of a MapReduce operation. Conceptually and architecturally this is no problem, and other similar projects like Disco allow this, but it hasn't been implemented in CouchDB yet.
- Debugging views (i.e. the MapReduce functions) is rather difficult. In Futon, at least, even a syntax error yields a very puzzling message. This is what you get if you forget a closing curly bracket, for example:

Once you have a syntactically correct function you can use the Javascript function log() (which directs output to couch.log) for printf-style debugging.
- Views can take a surprisingly long time to run. Querying them is quick, but producing them in the first place can easily take 10 or 20 times as long as adding an index to an SQLite database, say. It also seems fairly easy to mess up and accidentally create a view that is many times the size of the original data set.
- CouchDB is written in Erlang which evidently gives it a lot of fault tolerance and distributed computing features for free. The language itself is impressively concise (as well as looking rather odd): CouchDB 0.8.1 is only about 6500 lines of Erlang code (excluding the Javascript interpreter, which is written in C).