NASA and Apache: a match made in heaven!

January 22nd, 2010

It's official. NASA has its first sorta official Apache project!

At this stage, Apache's official policy is to not do official PR announcing the project (we are kind of at step 2 in the process, out of 3). So what could be less official than my crappy blog that no one reads! I should be able to lord on here for ages about it then and still be within the policy! ;)

Anyhoo we have a ways to go before graduation, having just had a formal vote and been accepted into the Incubator, but it couldn't hurt for me to dream, right?

Spatial SOLR

December 31st, 2009

So, lately, I've been pretty involved in Spatial SOLR. I'm just scratching the surface of trying to understand this stuff. There was an interesting plugin though posted recently by Mat Brown that looks really clean and easy to understand. I'm going to give it a go and see how it churns out on my oceans dataset.

Apache Tika 0.5 out the door

December 31st, 2009

I announced a while back that Tika 0.5 is available for downloading. Get it while it's hot. Notable changes include moving to a source only release this time, improved RDF and OWL parsing and detection and other speedups.

To Database or Map/Reduce It?

August 29th, 2009

The more and more meetings I've attended recently including some at major US funding institutions have included debates between two factions of people:

  • Those that would group themselves as members of the database community.
  • Those that would group themselves as members of the Hadoop/MapReduce community.

Now, debates are healthy, and I'm all for them, but I can't happen to think that this is an apples and oranges waste of time. At work, I regularly deal with the management of large scale science data, and its dissemination. My dissertation research investigated a framework for predicting (based on posterior analysis of successful decisions) the appropriate data movement technology (or set of technologies) to apply to these problems. What I'm hearing though regularly at these meetings regarding large scale data management is this fundmanetal(ly wrong) question:

Should we use a traditional DBMS solution or should we use Map Reduce to query and manage our large data?

The reasons behind this boil down to what i feel is really a misunderstanding on the part of the database community -- Hadoop and the Map Reduce paradigm of programming made famous by Google's paper are not necessarily replacements for database management systems. Nor should it be the goal for folks to synergistically merge the two, and build Databases that execute their queries using Map Reduce over Hadoop, or to build a Map Reduce framework that executes using a DBMS query executor. Neither of these really make sense because these technologies are really complimentary if anything, but more likely specific solutions to problems where they make sense. Let's take a real example of this:

I was involved in the NASA Orbiting Carbon Observatory mission, the one that unfortunately fell out of the sky in February 2009. The Level 2 algorithm that produced the output science products that would ultimately be distributed to the community for analysis looked very much like a Map Reduce type problem -- it involved a chunker, that split up a Level 1B data file into small pieces of the original file to be handed to many copies of the Level 2 algorithm, that were to execute in parallel -- the results of this distributed execution were then to be aggregated and disseminated as a single Level 2 product file. That said, this process involved many elements of data management -- for example, this Level 2 algorithm required that we staged the appropriate Level 1B file to a cluster node for use in processing -- additionally, the intermediate outputs generated by the Level 2 algorithm needed to be cataloged so that they could be looked up later for aggregation by the Level 2 aggregator.

In all, this was a problem that really had elements of both a DBMS type system, along with the Map Reduce paradigm of execution. This would be a solution where you can envision something like a Hadoop, as well as a DBMS system (for file cataloging) coming into play. So, why does it have to be one or the other? Why can't it be both?

In my opinion, the use of databases for science data really boils down to a few fundamental concerns:

  • Distribution of the data -- unfortunately, as opposed to traditional DBMS'es, the data can't just be dumped into a single database. There are ownership and funding issues, and requirements that the data be co-located with its science expertise.
  • Access patterns -- how do the users want to obtain the information? Do they want to obtain it using a declarative SQL type pattern? Even if this is true, this no longer mandates the use of a database, thanks to a recent Apache project or two.
  • Organization of science data -- science data is organized as file granules, that encapsulate a particular series of observations over space and time into a data unit -- science data is not organized in the relational model, as tables, with rows and columns for each row.

So, for me, while I think databases certainly have a purpose in ACID type operations, and environments where there are lots of transactions and file access patterns are ad-hoc (read/write and random access with append/replace anywhere), I think in the science realm life is different. Files are typically append-only and frequently read but not write.

What's more, the sheer impunity by which technologies like Hadoop and paradigms like Map Reduce have fostered a world-wide adoption and community of support leads me to believe that there is something there more than smoke, and something that certainly could benefit the scientific community in a more direct and immediate way.

To wrap it up though -- can we agree that the Map Reduce paradigm and Hadoop are really different beasts though than DBMS'es? They are not equivalent people!

Apache Tika 0.4 released - get it while it's hot

August 16th, 2009

I had the privilege of being the release manager for yet another Apache Tika release - 0.4 is out the door and has a number of major improvements over prior releases, including a major refactoring, separating out Tika's core components, from its parsers, from its external interfaces.

You can read the release announcement here.

You can grab Tika here.

WWW2009 in Madrid, Spain

April 24th, 2009

Hanging out in Spain for a week was a blast. Visited the Royal Palace, several museums (including the Prado), the Basilica church, and took a day trip to Toledo.

At this point, just looking forward to coming home!

Apache Tika 0.3 out the door

March 20th, 2009

Apache Tika, a sub-project of Apache Lucene, and a toolkit for content analysis and detection, has just made its 0.3 release.

You can grab the release from a nearby mirror here.

Apache Tika 0.2 released!

December 10th, 2008

Apache Tika 0.2 has recently been released!. Thanks to Dave Meike for leading the charge.

You can grab Tika 0.2 here. Of note is that Tika recently graduated out of the Incubator and is now a full fledged sub project of Apache Lucene.

w00t!

Last Degree EVAR

September 25th, 2007

Just received this today:

Kind of weird to be at the end of an era, but also really really liberating and good feeling as well.

Vegas Baby, Vegas

August 27th, 2007

Just recently got back from IEEE IRI in Las Vegas, Nevada. You can see a pic of me watching some asian lady give her intro talk to my session.

Pretty cool, huh? Wife and I went up and stayed at the Las Vegas Hilton, the conference hotel. Mediocre at best, but still, a nice little vacation.