Getting Started With MongoDB: Part 2

Ready to continue learning about MongoDB, one of the coolest technologies for web developers? In this second part of the series, we move on from the basics, on to advanced queries - with conditional operators - and MapReduce.

Stepping Beyond the Basics

Earlier, we covered the basics of and how to get started with mongoDB, one of the absolute best of breed of the NoSQL implementations. We looked at how to install it, create a basic database and then perform the basic operations on it:

search
update
delete

In addition to this, we started to look at how to start interacting with mongoDB in a more powerful way, through the use of selectors. Selectors give us the ability to have much more fine grained control and to dig pretty deeply in to find the data that we really want.

Now that's all well and good to get started with, but when you want to write a real application, you need to go a lot further. Well, this is still a getting started series after all, but I want to get you excited about the possibilities of working in a document-oriented way. I want to enthuse you to take this great technology and make it your own and use it as powerfully as you can to make fantastic applications.

Today, we're going to expand on queries from last time and learn two key aspects of mongoDB:

Advanced Queries
MapReduce

Advanced Queries

Previously we looked at basic queries and were introduced to selectors. Now we're going to get into more advanced queries, by building on the previous work in two key ways:

Conditional Operators
Regular Expressions

Each of these successively provide us with more fine-grained control over the queries we can write and, consequently, the information that we can extract from our mongoDB databases.

Conditional Operators

Conditional operators are, as the name implies, operators to collection queries that refine the conditions that the query must match when extracting data from the database. There are a number of them, but today I'm going to focus on 9 key ones. These are:

$lt – value must be less than the conditional
$gt – value must be greater than the conditional
$lte – value must be less than or equal to the conditional
$gte – value must be greater than or equal to the conditional
$in – value must be in a set of conditionals
$nin – value must NOT be in a set of conditionals
$not – value must be equal to a conditional

Let's look at each one in turn. Open up your terminal and get ready to use the original database from the first part in this series (pre-modifications). To make this tutorial easier, we're going to make a slight alteration to the database. We're going to give each document in our collection an age attribute. To do that, run the following modification query:

			db.nettuts.update({"_id" : ObjectId("4ef224be0fec2806da6e9b27")}, {"$set" : {"age" : 18 }});
			db.nettuts.update({"_id" : ObjectId("4ef224bf0fec2806da6e9b28")}, {"$set" : {"age" : 45 }});
			db.nettuts.update({"_id" : ObjectId("4ef224bf0fec2806da6e9b29")}, {"$set" : {"age" : 65 }});
			db.nettuts.update({"_id" : ObjectId("4ef224bf0fec2806da6e9b2a")}, {"$set" : {"age" : 43 }});
			db.nettuts.update({"_id" : ObjectId("4ef224bf0fec2806da6e9b2b")}, {"$set" : {"age" : 22 }});
			db.nettuts.update({"_id" : ObjectId("4ef224bf0fec2806da6e9b2c")}, {"$set" : {"age" : 45 }});
			db.nettuts.update({"_id" : ObjectId("4ef224bf0fec2806da6e9b2d")}, {"$set" : {"age" : 33 }});

All being well, you can run a 'find all' and you'll have the following output:

		db.nettuts.find();                                                                          
		{ "_id" : ObjectId("4ef224be0fec2806da6e9b27"), "age" : 18, "dob" : "21/04/1978", "first" : "matthew", "gender" : "m", "hair_colour" : "brown", "last" : "setter", "nationality" : "australian", "occupation" : "developer" }
		{ "_id" : ObjectId("4ef224bf0fec2806da6e9b28"), "age" : 45, "dob" : "26/03/1940", "first" : "james", "gender" : "m", "hair_colour" : "brown", "last" : "caan", "nationality" : "american", "occupation" : "actor" }
		{ "_id" : ObjectId("4ef224bf0fec2806da6e9b29"), "age" : 65, "dob" : "03/06/1925", "first" : "arnold", "gender" : "m", "hair_colour" : "brown", "last" : "schwarzenegger", "nationality" : "american", "occupation" : "actor" }
		{ "_id" : ObjectId("4ef224bf0fec2806da6e9b2a"), "age" : 43, "dob" : "21/04/1978", "first" : "tony", "gender" : "m", "hair_colour" : "brown", "last" : "curtis", "nationality" : "american", "occupation" : "developer" }
		{ "_id" : ObjectId("4ef224bf0fec2806da6e9b2b"), "age" : 22, "dob" : "22/11/1958", "first" : "jamie lee", "gender" : "f", "hair_colour" : "brown", "last" : "curtis", "nationality" : "american", "occupation" : "actor" }
		{ "_id" : ObjectId("4ef224bf0fec2806da6e9b2c"), "age" : 45, "dob" : "14/03/1933", "first" : "michael", "gender" : "m", "hair_colour" : "brown", "last" : "caine", "nationality" : "english", "occupation" : "actor" }
		{ "_id" : ObjectId("4ef224bf0fec2806da6e9b2d"), "age" : 33, "dob" : "09/12/1934", "first" : "judi", "gender" : "f", "hair_colour" : "white", "last" : "dench", "nationality" : "english", "occupation" : "actress" }

$lt/$lte

Now let's find all the actors who are less than 40. To do that, run the following query:

1	db.nettuts.find( { "age" : { "$lt" : 40 } } );

After running that query, you'll see the following output:

		{ "_id" : ObjectId("4ef224be0fec2806da6e9b27"), "age" : 18, "dob" : "21/04/1978", "first" : "matthew", "gender" : "m", "hair_colour" : "brown", "last" : "setter", "nationality" : "australian", "occupation" : "developer" }

		{ "_id" : ObjectId("4ef224bf0fec2806da6e9b2b"), "age" : 22, "dob" : "22/11/1958", "first" : "jamie lee", "gender" : "f", "hair_colour" : "brown", "last" : "curtis", "nationality" : "american", "occupation" : "actor" }

		{ "_id" : ObjectId("4ef224bf0fec2806da6e9b2d"), "age" : 33, "dob" : "09/12/1934", "first" : "judi", "gender" : "f", "hair_colour" : "white", "last" : "dench", "nationality" : "english", "occupation" : "actress" }

What about the ones who are less than 40 inclusive? Run the following query to return that result:

1	db.nettuts.find( { "age" : { "$lte" : 40 } } );

This returns the following list:

		{ "_id" : ObjectId("4ef224be0fec2806da6e9b27"), "age" : 18, "dob" : "21/04/1978", "first" : "matthew", "gender" : "m", "hair_colour" : "brown", "last" : "setter", "nationality" : "australian", "occupation" : "developer" }

		{ "_id" : ObjectId("4ef224bf0fec2806da6e9b2b"), "age" : 22, "dob" : "22/11/1958", "first" : "jamie lee", "gender" : "f", "hair_colour" : "brown", "last" : "curtis", "nationality" : "american", "occupation" : "actor" }

		{ "_id" : ObjectId("4ef224bf0fec2806da6e9b2d"), "age" : 33, "dob" : "09/12/1934", "first" : "judi", "gender" : "f", "hair_colour" : "white", "last" : "dench", "nationality" : "english", "occupation" : "actress" }

$gt/$gte

Now let's find all the actors who are older than 47. Run the following query to find that list:

1	db.nettuts.find( { 'age' : { '$gt' : 47 } } );

You'll then get the following output:

		{ "_id" : ObjectId("4ef224bf0fec2806da6e9b29"), "age" : 65, "dob" : "03/06/1925", "first" : "arnold", "gender" : "m", "hair_colour" : "brown", "last" : "schwarzenegger", "nationality" : "american", "occupation" : "actor" }

What about inclusive of 40?

1	db.nettuts.find( { 'age' : { '$gte' : 47 } } );

As there's only one person over 47, the data returned doesn't change.

$in/$nin

What about finding information based on a list of criteria? These first ones have been ok, but arguably, quite trivial. Let's now look to see which of the people we have are either actors or developers. With the following query, we'll find that out (to make it a bit easier to read, we've limited the keys that are returned to just first and last names):

1	db.nettuts.find( { 'occupation' : { '$in' : [ "actor", "developer" ] } }, { "first" : 1, "last" : 1 } );

This query, yields the following output:

			{ "_id" : ObjectId("4ef224be0fec2806da6e9b27"), "first" : "matthew", "last" : "setter" }
			{ "_id" : ObjectId("4ef224bf0fec2806da6e9b28"), "first" : "james", "last" : "caan" }
			{ "_id" : ObjectId("4ef224bf0fec2806da6e9b29"), "first" : "arnold", "last" : "schwarzenegger" }
			{ "_id" : ObjectId("4ef224bf0fec2806da6e9b2a"), "first" : "tony", "last" : "curtis" }
			{ "_id" : ObjectId("4ef224bf0fec2806da6e9b2b"), "first" : "jamie lee", "last" : "curtis" }
			{ "_id" : ObjectId("4ef224bf0fec2806da6e9b2c"), "first" : "michael", "last" : "caine" }

You can see that we can get the inverse of this by using $ninjust as simply.

Let's make this a bit more fun and combine some of the operators. Let's say that we want to look for all the people, who are either male or developers, they're less than 40 years of age.

Now that's a bit of a mouthful, but with the operators that we've used so far – quite readily achievable. Let's work through it and you'll see. Have a look at the query below:

1	db.nettuts.find( { $or : [ { "gender" : "m", "occupation" : "developer" } ], "age" : { "$gt" : 40 } }, { "first" : 1, "last" : 1, "occupation" : 1, "dob" : 1 } );

You can see that we've stipulated that the either the gender can be male or the occupation can be a developer in the $or condition and then added an and condition of the age being greater than 4.

For that, we get the following results:

			{ "_id" : ObjectId("4ef22e522893ba6797bf8cb6"), "first" : "matthew", "last" : "setter", "dob" : "21/04/1978", "occupation" : "developer" }

			{ "_id" : ObjectId("4ef22e522893ba6797bf8cb9"), "first" : "tony", "last" : "curtis", "dob" : "21/04/1978", "occupation" : "developer" }

Regular Expressions

Now I'm sure that you're not going to be satisfied with just this. I did promise you some more complexity and advanced functionality. So let's get in to using some regular expressions. Let's say that we want to find the users that have a first name starting with 'ma' or 'to' and who's last names begin with 'se' or 'de'. How would we do that?

Have a look at the following query using a regular expression:

1	db.nettuts.find( { "first" : /(ma\|to)*/i, "last" : /(se\|de)/i } );

Given that, the results will be:

		{ "_id" : ObjectId("4ef22e522893ba6797bf8cb6"), "first" : "matthew", "last" : "setter", "dob" : "21/04/1978", "gender" : "m", "hair_colour" : "brown", "occupation" : "developer", "nationality" : "australian" }

		{ "_id" : ObjectId("4ef22e532893ba6797bf8cbc"), "first" : "judi", "last" : "dench", "dob" : "09/12/1934", "gender" : "f", "hair_colour" : "white", "occupation" : "actress", "nationality" : "english" }

Let's look at that query a bit more closely. Firstly, we're performing a regex on the first name.

1	"first" : /(ma\|to)*/i

//i indicates that we're performing a case-insensitive regex.

(ma|to)* indicates that the start of the first name string must be either 'ma' or 'to'.

If you're not familiar, the * at the end, will match anything after that. So when you put it together, we match first names that have either 'ma' or 'to' at the beginning of them. In the regex for the last name, you can see that we've done the same thing, but for the last name.

Not quite sure? Let's try another one. What about combining it with one of the conditional operators. Let's say we want to find all the people with the first name of james or jamie who are american female actors. How would we do that? Well, let's see how we'd do it below:

1	db.nettuts.find( { "first" : /(jam?e)/i, "gender" : "f", "occupation" : "actor", "nationality" : "american" } );

The regex above will match combinations such as: james, jamie, jamee etc. The question mark will match one character, whether a-z, A-Z or 0-9. Then, as before, the * matches anything else that comes after the 'e'. From there on, we're using the conditional operators from before to further limit the results that come back. It should be noted that as we're using the case-insensitive operator, I, the queries won't use an index. But for the purposes of this example, it's fine.

The output of the query above is:

		{ "_id" : ObjectId("4ef22e522893ba6797bf8cba"), "first" : "jamie lee", "last" : "curtis", "dob" : "22/11/1958", "gender" : "f", "hair_colour" : "brown", "occupation" : "actor", "nationality" : "american" }

MapReduce

MapReduce is the big daddy of data analysis. In case you've not heard of it, MapReduce is a process where the aggregation of data can be split up and farmed out across a cluster of computers to reduce the time that it takes to determine an aggregate result on a set of data.

It's made up of two parts: Map and Reduce. Map creates the jobs that can then be farmed out to the worker nodes to run the Reduce component. Reduce then computes the answer for that chunk of work that was farmed out to it and returns the result that can be combined with the other chunks to form the final answer.

If you want a more specific description, here's what Wikipedia has to say about it:

MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured).

"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

Example MapReduce

Let's look at a simple example. We're going to analyse our simple dataset and find the total count of all the females in the group. Admittedly, this is a very simplistic example, but it will lay the foundation for understanding, practically, how MapReduce works.

The Map Function

Here, we're going to create a map function that aggregates the information in our dataset by the gender of the person and emits a count of 1 for every one of them.

1	var map = function() {
2	emit( { gender: this.gender }, { count: 1 } );
3	}

This will return output similar to the following:

1	{ 'f' : 1 }

The Reduce Function

Our reduce function is going to take the output from the map function and use it to keep a running total of the count for each gender. Have a look at the reduce function below.

			var reduce = function(key, values) {
			    var result = { count : 0 };
    
			    values.forEach(function(value){
			        result.count += value.count;
			    })
    
			    return result;
			}

Running the MapReduce

Now, we put them together by calling the mapReduce function in our current database. We pass in the map and reduce variables we created previously by calling our map and reduce functions and specify the name of the collection that the result will be stored in; in this case 'gender'. Just to reiterate, the result of calling the mapReduce function is a collection in our current database; that we can iterate over just like we would any other collection. Have a look at the code below:

1	var res = db.nettuts.mapReduce( map, reduce, { out : 'gender' } );

Displaying the output

When the Map-Reduce is completed, we can access it just like a normal collection, by running the findfunction on it as we do below.

1	db.gender.find();
2	{ "_id" : { "gender" : "f" }, "value" : { "count" : 2 } }
3	{ "_id" : { "gender" : "m" }, "value" : { "count" : 5 } }

Here, we now have a running total per gender; 2 for females and 5 for males - which correlates with our dataset. But what if we wanted to filter by females in the group. Well, not much to it. We only need to make use of the query clause to allow us to do so. Lucky for us, it will look familiar. Have a look at the query below.

1	var res = db.nettuts.mapReduce( map, reduce, { out : 'gender', query : { "gender" : "f" } } );

Now, when we display the output, it will look like that below:

1	db.gender.find();
2	{ "_id" : { "gender" : "f" }, "value" : { "count" : 2 } }

There are a number of other parameters that you can pass to the mapReduce function to further customise the output.

sort - sort the output returned
limit - limit the number of results returned
out - the name of the collection to store the results in
finalise - specify a function to run after the reduction process is complete
scope - specify variables that can be used in the map, reduce and finalise function scope
jsMode - Avoids an intermediate step (between map and reduce) of converting back to JSON format
verbose - track statistics about the execution process

Winding Up

This has been a whirlwind coverage of some of the more complex topics of mongoDB. But I hope that it's given you even more of a taste of what's possible through using this great tool.

We've looked at the conditional operators: $lt, $gt, $lte, $gte, $in, $nin, $not and run through an introduction to MapReduce. I hope that you've gotten a lot out of this and will learn more about the great tool that is mongoDB.