MongoDB and Hadoop … a powerful combination

MongoDB and Hadoop are a powerful combination and can be used together to deliver complex analytics and data processing for data stored in MongoDB. The following guide shows how you can start working with the MongoDB Connector for Hadoop. Once you become familiar with the connector, you can use it to pull your MongoDB data into Hadoop Map-Reduce jobs, process the data and return results back to a MongoDB collection.



In order to use the following guide, you should already have Hadoop up and running. This can range from a deployed cluster containing multiple nodes or a single node pseudo-distributed Hadoop installation running locally. As long as you are able to run any of the examples on your Hadoop installation, you should be all set. The Hadoop connector supports all Apache Hadoop versions 2.X and up, including distributions based on these versions such as CDH4, CDH5, and HDP 2.0 and up.


Install and run the latest version of MongoDB. In addition, the MongoDB commands should be in your system search path (i.e. $PATH).

If your mongod requires authorization [1], the user account used by the Hadoop connector must be able to run the splitVector command on the input database. You can either give the user the clusterManager role, or create a custom role for this:

  role: "hadoopSplitVector",
  privileges: [{
    resource: {
      db: "myDatabase",
      collection: "myCollection"
    actions: ["splitVector"]
  user: "hadoopUser",
  pwd: "secret",
  roles: ["hadoopSplitVector", "readWrite"]

Note that the splitVector command cannot be run through a mongos, so the Hadoop connector will automatically connect to the primary shard in order to run the command. If your input collection is unsharded and the connector reads through a mongos, make sure that your MongoDB Hadoop user is created on the primary shard as well.