To Database or Map/Reduce It?
August 29, 2009The more and more meetings I've attended recently including some at major US funding institutions have included debates between two factions of people:
- Those that would group themselves as members of the database community.
- Those that would group themselves as members of the Hadoop/MapReduce community.
Now, debates are healthy, and I'm all for them, but I can't happen to think that this is an apples and oranges waste of time. At work, I regularly deal with the management of large scale science data, and its dissemination. My dissertation research investigated a framework for predicting (based on posterior analysis of successful decisions) the appropriate data movement technology (or set of technologies) to apply to these problems. What I'm hearing though regularly at these meetings regarding large scale data management is this fundmanetal(ly wrong) question:
Should we use a traditional DBMS solution or should we use Map Reduce to query and manage our large data?The reasons behind this boil down to what i feel is really a misunderstanding on the part of the database community -- Hadoop and the Map Reduce paradigm of programming made famous by Google's paper are not necessarily replacements for database management systems. Nor should it be the goal for folks to synergistically merge the two, and build Databases that execute their queries using Map Reduce over Hadoop, or to build a Map Reduce framework that executes using a DBMS query executor. Neither of these really make sense because these technologies are really complimentary if anything, but more likely specific solutions to problems where they make sense. Let's take a real example of this:
I was involved in the NASA Orbiting Carbon Observatory mission, the one that unfortunately fell out of the sky in February 2009. The Level 2 algorithm that produced the output science products that would ultimately be distributed to the community for analysis looked very much like a Map Reduce type problem -- it involved a chunker, that split up a Level 1B data file into small pieces of the original file to be handed to many copies of the Level 2 algorithm, that were to execute in parallel -- the results of this distributed execution were then to be aggregated and disseminated as a single Level 2 product file. That said, this process involved many elements of data management -- for example, this Level 2 algorithm required that we staged the appropriate Level 1B file to a cluster node for use in processing -- additionally, the intermediate outputs generated by the Level 2 algorithm needed to be cataloged so that they could be looked up later for aggregation by the Level 2 aggregator.
In all, this was a problem that really had elements of both a DBMS type system, along with the Map Reduce paradigm of execution. This would be a solution where you can envision something like a Hadoop, as well as a DBMS system (for file cataloging) coming into play. So, why does it have to be one or the other? Why can't it be both?
In my opinion, the use of databases for science data really boils down to a few fundamental concerns:
- Distribution of the data -- unfortunately, as opposed to traditional DBMS'es, the data can't just be dumped into a single database. There are ownership and funding issues, and requirements that the data be co-located with its science expertise.
- Access patterns -- how do the users want to obtain the information? Do they want to obtain it using a declarative SQL type pattern? Even if this is true, this no longer mandates the use of a database, thanks to a recent Apache project or two.
- Organization of science data -- science data is organized as file granules, that encapsulate a particular series of observations over space and time into a data unit -- science data is not organized in the relational model, as tables, with rows and columns for each row.
So, for me, while I think databases certainly have a purpose in ACID type operations, and environments where there are lots of transactions and file access patterns are ad-hoc (read/write and random access with append/replace anywhere), I think in the science realm life is different. Files are typically append-only and frequently read but not write.
What's more, the sheer impunity by which technologies like Hadoop and paradigms like Map Reduce have fostered a world-wide adoption and community of support leads me to believe that there is something there more than smoke, and something that certainly could benefit the scientific community in a more direct and immediate way.
To wrap it up though -- can we agree that the Map Reduce paradigm and Hadoop are really different beasts though than DBMS'es? They are not equivalent people!