Christopher Maier's Technical Blog

Creating a Query DSL Using Clojure and MongoDB

| Comments

One of the nice things about MongoDB (particularly when using it in Clojure via the Congomongo library) is that its map-based query language is so amenable to the creation of a domain-specific language, or DSL. Creating and manipulating maps is like breathing in Clojure, so it is trivial to decompose the different query requirements of your application into a small collection of simple functions that can be used to create a rather fluent domain-specific language. The data-structure-based query language of MongoDB makes this possible (or at least easier; it would be much more difficult to do in a string-based language like SQL).

Not only does creating a DSL make querying easy (particularly with complex conditions), but it also insulates your application from change in a few important ways. Especially in the initial, exploratory stages of a project, it is common to change and evolve a data schema, and NoSQL environments make this very simple. Using a DSL will shield your code from these changes; you only need to change the DSL “atoms” that the schema change affects.

Another benefit is that you can more easily change out your underlying database when and if the need arises. With SQL databases, this is not as big of a problem. SQL is a standard, and we have things like JDBC to provide (more or less) equivalent interaction with SQL databases (yes, reality is more complicated, but we’re comparing to swapping out one NoSQL database for another). There is no corresponding “NoSQL standard”, but even if there were, there are so many different kinds of NoSQL databases (document, graph, key-value, column store, etc.) that there probably can’t be any sort of meaningful general abstraction like JDBC that covers them all. However, when you create a query DSL, you don’t need to create a completely general abstraction over your underlying database; you just need one that works for your project.

I recently implemented a simple DSL for a project at work that we use for querying complex proteomics and genomics data. I’ll illustrate a small bit of the DSL here to describe the general approach and show some of the benefits.

Background

In a nutshell, we’re querying to find certain features within the human genome. The raw data are called “peptide / spectrum matches”, or a “PSMs”. They have sequences, scores, and genomic coordinates, among other things, and we query to find PSMs based on various combinations of these criteria. We store the data in MongoDB, with one document per PSM, and query using Congomongo.

If you want to find all PSMs that have a particular peptide sequence, you’d have a query map like this;

1
{:peptide.sequence "GLYQRPHDSTRFK"}

If you want to further restrict that to only results that have an expectation value of no greater than 0.01, you’d use this:

1
2
{:peptide.sequence "GLYQRPHDSTRFK"
 :scores.e-value {:$lte 0.01}}

Further restricting results to lying within a region of a chromosome would be done like this:

1
2
3
4
5
6
{:peptide.sequence "GLYQRPHDSTRFK"
 :scores.e-value {:$lte 0.01}
 :location.chromosome "X"
 :location.strand "+"
 :location.start {:$gte 12345}
 :location.stop {:$lte 34567}}

Creating the DSL

In reality, there are many more criteria, but by this point a pattern suggests itself. Each individual criterion will be a map, while each query will be a simple merging of these maps. Let’s start with the query function first, which we’ll use to generate the final query map (not actually perform the query).

1
2
(defn query [& criteria]
  (apply merge criteria))

That’s it. Now for the rest of the criteria:

1
2
3
4
5
6
7
8
9
10
11
(defn matches-peptide [peptide]
  {:peptide.sequence peptide})

(defn with-e-value-cutoff [cutoff]
  {:scores.e-value {:$lte cutoff}})

(defn in-region [{:keys [chromosome strand start stop]}]
  {:location.chromosome chromosome
   :location.strand strand
   :location.start {:$gte start}
   :location.stop {:$lte stop}})

All very straightforward. Now, when we want to create a final query, we write something like this:

1
2
3
4
5
6
(query (matching-peptide "GLYQRPHDSTRFK")
       (with-e-value-cutoff 0.001)
       (in-region {:chromosome "X"
                   :strand "+"
                   :start 12345
                   :stop 34567}))

That’s pretty readable. We’ve gained a lot of flexibility, too, since we’ve decoupled the semantic meaning of a query from the underlying syntactic realities of my data schema and database. We’re free to change how we structure the underlying data (something we’ve already done several times in the course of this project!). For instance, maybe we’ll want to represent a peptide as a plain String instead of a complex object like we have here. We only need to change one line of code for the queries to keep working.

We can go further, extending our DSL to actually retrieving the results.

1
2
(find-psms [& criteria]
  (fetch PSM-COLLECTION :where (apply query criteria)))

Here, our application no longer even needs to be aware of which collection we’re searching. The code to retrieve our results is now:

1
2
3
4
5
6
(find-psms (matching-peptide "GLYQRPHDSTRFK")
           (with-e-value-cutoff 0.001)
           (in-region {:chromosome "X"
                       :strand "+"
                       :start 12345
                       :stop 34567}))

That’s almost exactly what the equivalent request would be in plain English. You don’t get much simpler.

Conclusion

Obviously what I have shown here is pretty basic stuff, and not at all difficult to implement. There’s a lot more that the application will have to do, including paging, limiting, sorting, as well as more complicated queries. However, there’s not much more functionality that needs to be added that is significantly different from what’s been shown. And look what has been gained: an almost-English query language that insulates our application not only from the specific modeling choices we’ve made, but also from the specific database system we are using. This last point is particularly nice in my case, as I plan to migrate from MongoDB to a Neo4j graph database in the near future. Using this DSL internally is going to make that task significantly more straightforward.

Update: Aaron Crow mentioned this post in his presentation Clojure on Mongo: Fun and Easy with CongoMongo, presented at Mongo LA on 19 January 2012. Thanks, Aaron!

Writing Elegant Clojure Code Using Higher-Order Functions

| Comments

Back when I first started writing Clojure code, I heard lots about the use of higher-order functions (also known as HOFs). Since functions are first-class language members in Clojure, you can do things like pass them as arguments or return them from function calls. Any function that accepts or produces another function in this way is a higher-order function. This allows you to write some very powerful and consise code, because you can capture the general form of a computation, while allowing its specific behavior to be determined at runtime by the user. You basically say to the function caller “I’m going to give you X, but I’ll make it however you tell me to”.

So I knew about all this, and could see how powerful a technique it could be, but I didn’t fully grok the whole concept yet. Coming from a mainly Java background at the time, I hadn’t had any experience with first-class functions, and still approached everything from a procedural and object-oriented background. Using HOFs was a bit of an alien concept.

Clojure is littered with HOFs; if you’re new to the language, you’ve already used them, perhaps without realizing it. The map function is probably the archetype of a HOF. It iterates through a sequence of items, applying a function to each one in turn, and returns a sequence of the results. It says “I’m going to transform each element of this sequence, but I’ll do it however you tell me to”. So, you can pass in the inc function to map to increment each number in a list, like so:

user> (map inc [1 2 3 4 5])
(2 3 4 5 6)

You can also use an anonymous function (because hey, it’s still a function), allowing you to, say, multiply each number in a list by five:

user> (map #(* 5 %) [1 2 3 4 5])
(5 10 15 20 25)

All well and good, but this is all still pretty basic. As I said, these kinds of functions are all over Clojure, and you quickly figure out how to use them, out of necessity if nothing else; it’s difficult to do anything in Clojure without them! Soon I realized that it wasn’t the function-accepting HOFs that I hadn’t quite gotten; it was the function-generating HOFs that I didn’t fully appreciate. Clojure has several of these functions, too, and mastering them really allows you to create some elegant constructs. I’ll mainly talk about partial and comp, but there are also juxt and complement, and I may have overlooked others. And of course, you can always make your own.

The simplest of Clojure’s built-in function generators is probably partial, which lets you “prime” an existing function with some number of arguments. For example, you could make a “quintupler” function like we used above, but using partial like this:

user> (def quintuple (partial * 5))
#'user/quintuple
user> (map quintuple [1 2 3 4 5])
(5 10 15 20 25)

This quintuple function is just the standard multplication function (*), already primed with a first argument of 5 (it is exactly equivalent to (fn [x] (* 5 x))). Any other arguments passed into quintuple will also be multiplied together, and then multiplied by 5. (Though I’ve broken quintuple out as a separate function here, it is more idiomatic to use it directly, like (map (partial * 5) [1 2 3 4 5]).)

The comp function is a little trickier, but not by much. Short for “compose”, it carries out the functional composition you learned about in high school algebra class (you remember f(g(h(x)), right?). So basically, (comp f g h) creates a function that will apply the function h to its arguments, then apply g to the result, then apply f to the result of that. Of course, you can supply as many functions as you like. In this way, it’s similar to (but not the same as!) Clojure’s threading macros (-> and ->>).

Actually, this whole post is basically an excuse to share the fun trick I recently discovered. Say you need to condense a sequence of pairs into a map. No problem, right? We’ll just use into.

user> (def pairs [[:one 1] [:two 2] [:three 3]])
#'user/pairs
user> (into {} pairs)
{:one 1, :two 2, :three 3}

Now for a wrinkle: what if a key repeats?

user> (def pairs [[:one 1] [:two 2] [:three 3] [:rest 4] [:rest 5] [:rest 6]])
#'user/pairs
user> (into {} pairs)
{:one 1, :two 2, :three 3, :rest 6}

That’s no good; each successive pair with a duplicated key will overwrite the previous values. What you really want is to create a sequence if there are multiple values, but not if there’s only one. HOFs to the rescue!

user> (apply merge-with
             (comp vec flatten vector)
             (map (partial apply hash-map)
                  pairs))
user> {:rest [4 5 6], :three 3, :two 2, :one 1}

That did it! The magic happens with (comp vec flatten vector). This generates the function that merge-with will use to combine the pairs together (once we turn them into maps, that is). If a key is already present, the function gets called with both the existing value and the value to be added. This can be a bit tricky to grasp at first, so I’ll walk through what’s happening step by step.

Keep in mind that our merge function, (comp vec flatten vector) is only called when there is already a value for a given key. If it’s the first time we’re merging a particular key, there will be one value, but for all subsequent times, there will be a vector of values. We thus have two cases to examine. Since the comp is just composing vec, flatten, and vector, I’ll split out each operation to show what happens.

First, we’ll look at what happens the first time we merge a value. We’ll call the pre-existing value :A and the incoming (to-be-merged) value :B; in the end, we’ll expect to see [:A :B].

user> (vector :A :B)
[:A :B]
user> (flatten [:A :B])
(:A :B)
user> (vec '(:A :B))
[:A :B]
user>

Remember, comp passes the result of one computation as the input to the next in the chain. In this particular scenario, the calls to flatten and vec seem unnecessary; after all, we could have stopped after vector and been done with it. If you’re only ever going to merge two values, then yes, you could have just used vector… but that’s not very interesting, is it? Let’s continue on with our example and merge in an additional value, :C. This time we’re starting with the vector [:A :B], which will illustrate the second case of behavior.

user> (vector [:A :B] :C)
[[:A :B] :C]
user> (flatten [[:A :B] :C])
(:A :B :C)
user> (vec '(:A :B :C))
[:A :B :C]
user>

Now the need for flatten is apparent; if we didn’t use it, we’d end up with an increasingly nested set of vectors within vectors within vectors. By flattening, we eliminate the nesting before it has a chance to start. But flatten gives us a sequence, and we wanted to get a vector back. No problem; vec to the rescue! (Strictly speaking, everything could still work fine without vec, so long as you don’t mind a mixture of vectors and sequences as values in your data structure).

Now we can see that in both cases, we end up with a vector of all the values for a given key being plugged into our growing map. Of course, the astute reader will recognize the (potential) bug lurking here: what if one of your values is already a vector? If that’s the case, this particular implementation will not be very kind to you, since it unmercilessly flattens everything in sight. You can get around this, though (and I leave that as an exercise for the reader); in this article I’m focusing on the uses of higher order functions… that, and the software I wrote this function for never has to deal with vector values, so there :P

So that covers the comp-generated HOF, but there’s another HOF lurking in there, too: (partial apply hash-map). All that does is convert the vector pairs into maps for feeding into merge-with (we have to use apply, because hash-map is not expecting a sequence as input). I told you: HOFs are everywhere in Clojure.

Now, contrast this to how I would have written this function when I was young and foolish, pre-HOF:

(apply merge-with
       (fn [vals v]
         (if (vector? vals)
           (conj vals v)
           [vals v]))
       (for [[k v] pairs]
        {k v}))

Quite a bit more verbose and just uglier. Also note that the (comp vec flatten vector) and (partial apply hash-map) functions are more general and re-usable than their wordier counterparts.

This just shows that you can get the job done in Clojure in any number of ways, but to get really succinct and elegant code, it pays to get familiar with Clojure’s function-generating functions.

Exploration of juxt (quite handy for destructuring let bindings) and complement (great for use with filter, remove, and other predicate-consuming functions) are left as exercises for the reader.

Github’s New “Fork and Edit” Feature Is Awesome

| Comments

Github recently rolled out a new Fork and Edit feature that is pretty awesome. It basically allows you to create your own fork of any Github project, edit files in a new branch of that fork, and create a pull request back to the original project, all from your web browser, and in about as much time as it’s taken me to write this sentence. This really lowers the barrier for contributing code to other projects.

For instance, I recently was working on a Clojure web application at work and generating some HTML with James Reeve’s excellent Hiccup library. I was having some problems with a <canvas> tag being rendered improperly: Hiccup was rendering it as this:

<canvas id='spectrum' />

Apparently, canvases are “container tags” which need to be explicitly closed with the appropriate tag, like this:

<canvas id='spectrum'></canvas>

The fix in Hiccup is simple enough: add the string “canvas” to a private set of container tags. It is literally adding one string to a data structure. I made a fix in my own Clojure project so I could keep working (thank you, clojure.core/in-ns!) and made a note to go back to formally and submit the change to Hiccup later.

But then I noticed the “Fork and edit this file” button at the top of the page. “What the hell? Let’s give it a shot”, I thought as I pressed it. Instant fork and branch creation, and I’m in an editable view of the code in question. I scroll down, add the magical “canvas” string at the right place, hit commit, and it’s all done. The patch was submitted as a pull request and it all took about 2 minutes from start to finish, a large portion of that time taken up by me thinking “Damn, this is slick!”.

Now, I wouldn’t dream of doing this for any kind of involved coding (Emacs is a far better code editor than an HTML text field, thank you very much), but for minor fixes like this, it’s golden. Think how great this would be for documentation fixes!

And honestly, “Fork and Edit” is really what’s needed for these kinds of fixes. If I’m reading through some code on Github (which I actually kind of prefer, truth be told) and I see some confusing or unclear documentation or some other minor problem, I’m going to have to be really motivated to create a fork, download that code to my computer, fire up Emacs, make the change, push the code back to my fork, and issue a pull request. However, if I can make the change right there while I’m thinking about it, without having to do anything else, then I’m probably going to do it.

Fork and Edit makes contributing to open source code about as easy as editing Wikipedia, and that’s a very good thing.

The Importance of MongoDB Key Names

| Comments

Coming from a relational database background, you might not devote a lot of thought to the names of your columns. That is to say, while you make an effort to come up with a sensible and descriptive naming scheme for your columns, you probably don’t think about the amount of space those names take up. And why should you? In a relational databases, the column names are stored once, probably in a central metadata table. Whether you call a column sequence or seq or s, it isn’t going to make any practical difference in how much space your database uses.

In a schema-free database system like MongoDB, however, the key names you choose can have a sizable impact on the final size of a collection, depending on the relative size of the keys to the entire document size, as well as the number of documents in the collection. This is due to the fact that the key names are stored in each document; this is a side effect of being schema-free.

Recently I was experimenting with a database at work with 1.6 billion documents (yes, that’s with a B). Each document looks like this:

{"sequence":"AHAHSPGPGSAVKLPAPHSVGKSALR",
 "location":{
     "chromosome":"19",
     "strand":"-",
     "begin":"51067007",
     "end":"51067085"
 }}

Once all the data were loaded up here’s what the collection stats looked like:

> db.peptides.stats()
{
    "ns" : "haystack.peptides",
    "count" : 1602177119,
    "size" : 216603098452,
    "avgObjSize" : 135.1929795297495,
    "storageSize" : 243349563712,
    "numExtents" : 144,
    "nindexes" : 1,
    "lastExtentSize" : 2146426864,
    "paddingFactor" : 1,
    "flags" : 1,
    "totalIndexSize" : 66625226448,
    "indexSizes" : {
            "_id_" : 66625226448
    },
    "ok" : 1
}

Total size is around 243 GB.

Then I did a little experiment and used the smallest keys possible; all my keys are now one letter long. Now my documents look like this:

{"s":"AHAHSPGPGSAVKLPAPHSVGKSALR",
 "l":{
     "c":"19",
     "s":"-",
     "b":"51067007",
     "e":"51067085"
 }}

Now, when I reload all the data with this schema, I get these collection stats:

> db.tinypeptides.stats()
{
    "ns" : "haystack.tinypeptides",
    "count" : 1602177119,
    "size" : 155730331732,
    "avgObjSize" : 97.19919844392685,
    "storageSize" : 182977326080,
    "numExtents" : 118,
    "nindexes" : 1,
    "lastExtentSize" : 2146426864,
    "paddingFactor" : 1,
    "flags" : 1,
    "totalIndexSize" : 66625734352,
    "indexSizes" : {
            "_id_" : 66625734352
    },
    "ok" : 1
}

Final size now is just shy of 183 GB, for a savings of about 60 GB. Not too bad!

Of course, you shouldn’t just go changing all your keys for the hell of it. You’ll need to consider how you use your data in applications; who knows, maybe constantly translating between space-saving short keys and human-readable long keys will be too expensive or unwieldy for your particular project. But if it makes sense, using shorter keys can potentially save a lot of space and even increase performance (after all, more of a smaller database can fit into memory).

Granted, my documents are very small to begin with, so the amout of space devoted to the keys is relatively large. The proportional space savings for collections with larger documents probably will not be as great; it all depends on your data. In any event, it’s good to be aware of this aspect of schema-free databases.

Not-So-Private Clojure Functions

| Comments

If you’ve been programming in Clojure for longer than, oh, about 5 minutes, you probably already know how defn creates a publicly accessible function in a namespace, while defn- creates a private one. If you’re outside the original namespace and you try to call a private function, you will get the smackdown.

Here’s a simple demonstration. We’ll create two functions, one public and one private, in the user namespace:

user> (defn hello []
        "Hello World")
#'user/hello
user> (hello)
"Hello World"
user> (defn- secret []
        "TOP SECRET")
#'user/secret
user> (secret)
"TOP SECRET"

If we switch to the other namespace, though, we can only use the public one:

user> (ns other)
nil
other> (user/hello)
"Hello World"
other> (user/secret)

Oops!

var: #'user/secret is not public
  [Thrown class java.lang.IllegalStateException]

However, you can get around the private flag; all you need to do is refer directly to the function’s var:

other> (#'user/secret)
"TOP SECRET"

You can make it a bit easier by creating a var in your new namespace that points to the private one:

other> (def secret #'user/secret)
#'other/secret
other> (secret)
"TOP SECRET"

Now why the hell would you ever want to do this? In general, you probably shouldn’t, at least with other people’s code. Private functions are private for a reason; they’re not part of any public API, so they could disappear or change at a moment’s notice. However, it can come in handy when you’re testing your own code. Often, I’ll have a few private functions that do something useful within a namespace, but really have no business being used anywhere else. Sometimes when I’m testing my public functions, though, I’ll find myself needing these private functions to either set things up, create test data, or otherwise verify that things turned out alright.

You could also create a separate namespace for all your private helper functions (making them public this time), and then only ever pull that namespace into your main and test namespaces (Fogus and Chouser describe this approach in Section 9.1.2 of The Joy of Clojure; conveniently this chapter is also available as a free download). If you’ve only got a handful of these functions, though, this var shadowing trick is pretty straightforward.