IEEE Senior Membership

August 28th, 2010

So it's been a while since I last blogged, but I figured there was something noteworthy to talk about today, so here I am.

I found out yesterday that I was elevated to IEEE Senior Membership, something that I had applied for a few months ago after having felt that I was ready and had demonstrated the qualities they are looking for (pasted here for your convenience, see here for full list):

  • a candidate shall be an engineer, scientist, educator, technical executive or originator in IEEE-designated fields.
  • candidates shall have been in professional practice for at least ten years.
  • candidates shall have shown significant performance over a period of at least five of those years.

However, anyone that knows me knows that I don't take anything for granted, nor do I feel it's "in the bag" on anything like this -- I constantly worry until I find out one way or another. So, it was to my utter surprise yesterday that I found out, I made it! I was made an IEEE Senior Member!. Wow.

Check out these sweet benefits:

  • Recognition: The professional recognition of your peers for technical and professional excellence.
  • Senior Member Plaque: Since January 1999, all newly elevated senior members have received an engraved Senior Member plaque to be proudly displayed for colleagues, clients and employers to see. The plaque, an attractive fine wood with bronze engraving, is sent within six to eight weeks after elevation.
  • US$25 Coupon: IEEE will recognize all newly elevated senior members with a coupon worth up to US$25. This coupon can be used to join one new IEEE society. The coupon expires on 31 December of the year in which it is received.
  • Letter of Commendation: A letter of commendation will be sent to your employer on the achievement of senior member grade (upon the request of the newly elected senior member).
  • Announcements: Announcement of elevation can be made in section/society and/or local newsletters, newspapers and notices.
  • Leadership Eligibility: Senior members are eligible to hold executive IEEE volunteer positions.
  • Ability to Refer Other Candidates: Senior members can serve as a reference for other applicants for senior membership.
  • Review Panel: Senior members are invited to be on the panel to review senior member applications.

Thanks much to Dr.Barry Boehm, Dr. Ellis Horowitz, and Dr. Ian Gorton for agreeing to be my recommendations for the nomination!

Also, I just wanted to throw a shout out to Dr. James Marshall, who was also elevated yesterday. Great job, Jim!

Apache Tika 0.7 and Tika TLP

April 4th, 2010

I just cut the Apache Tika 0.7 release.

You can go grab it from one of the Apache download mirrors.

Of note, per the recent discussions on the mailing list, Tika is now going to be an Apache Top-Level Project (TLP), so this will be our last Apache Lucene based release. Thanks for the support, Lucene community!

Enjoy!

Membership in the Apache Software Foundation

April 4th, 2010

Wow. So, I received an email the other day asking me if I accepted membership within the Apache Software Foundation (ASF). For a long time, I've been a participant, as a committer, as someone who was the progenitor of one of Apache's projects (Tika), as a PMC member (within Apache Lucene), but I honestly was taken aback that I had been nominated and accepted into the ASF as a member.

I sincerely appreciate the goodwill, Apache members, and will do my best to live up to the great precedent by the other ASF members. Of note: it seems I'm the first member from the National Aeronautics and Space Administration (NASA), and the first member to represent the Jet Propulsion Laboratory.

NASA and Apache: a match made in heaven!

January 22nd, 2010

It's official. NASA has its first sorta official Apache project!

At this stage, Apache's official policy is to not do official PR announcing the project (we are kind of at step 2 in the process, out of 3). So what could be less official than my crappy blog that no one reads! I should be able to lord on here for ages about it then and still be within the policy! ;)

Anyhoo we have a ways to go before graduation, having just had a formal vote and been accepted into the Incubator, but it couldn't hurt for me to dream, right?

Spatial SOLR

December 31st, 2009

So, lately, I've been pretty involved in Spatial SOLR. I'm just scratching the surface of trying to understand this stuff. There was an interesting plugin though posted recently by Mat Brown that looks really clean and easy to understand. I'm going to give it a go and see how it churns out on my oceans dataset.

Apache Tika 0.5 out the door

December 31st, 2009

I announced a while back that Tika 0.5 is available for downloading. Get it while it's hot. Notable changes include moving to a source only release this time, improved RDF and OWL parsing and detection and other speedups.

To Database or Map/Reduce It?

August 29th, 2009

The more and more meetings I've attended recently including some at major US funding institutions have included debates between two factions of people:

  • Those that would group themselves as members of the database community.
  • Those that would group themselves as members of the Hadoop/MapReduce community.

Now, debates are healthy, and I'm all for them, but I can't happen to think that this is an apples and oranges waste of time. At work, I regularly deal with the management of large scale science data, and its dissemination. My dissertation research investigated a framework for predicting (based on posterior analysis of successful decisions) the appropriate data movement technology (or set of technologies) to apply to these problems. What I'm hearing though regularly at these meetings regarding large scale data management is this fundmanetal(ly wrong) question:

Should we use a traditional DBMS solution or should we use Map Reduce to query and manage our large data?

The reasons behind this boil down to what i feel is really a misunderstanding on the part of the database community -- Hadoop and the Map Reduce paradigm of programming made famous by Google's paper are not necessarily replacements for database management systems. Nor should it be the goal for folks to synergistically merge the two, and build Databases that execute their queries using Map Reduce over Hadoop, or to build a Map Reduce framework that executes using a DBMS query executor. Neither of these really make sense because these technologies are really complimentary if anything, but more likely specific solutions to problems where they make sense. Let's take a real example of this:

I was involved in the NASA Orbiting Carbon Observatory mission, the one that unfortunately fell out of the sky in February 2009. The Level 2 algorithm that produced the output science products that would ultimately be distributed to the community for analysis looked very much like a Map Reduce type problem -- it involved a chunker, that split up a Level 1B data file into small pieces of the original file to be handed to many copies of the Level 2 algorithm, that were to execute in parallel -- the results of this distributed execution were then to be aggregated and disseminated as a single Level 2 product file. That said, this process involved many elements of data management -- for example, this Level 2 algorithm required that we staged the appropriate Level 1B file to a cluster node for use in processing -- additionally, the intermediate outputs generated by the Level 2 algorithm needed to be cataloged so that they could be looked up later for aggregation by the Level 2 aggregator.

In all, this was a problem that really had elements of both a DBMS type system, along with the Map Reduce paradigm of execution. This would be a solution where you can envision something like a Hadoop, as well as a DBMS system (for file cataloging) coming into play. So, why does it have to be one or the other? Why can't it be both?

In my opinion, the use of databases for science data really boils down to a few fundamental concerns:

  • Distribution of the data -- unfortunately, as opposed to traditional DBMS'es, the data can't just be dumped into a single database. There are ownership and funding issues, and requirements that the data be co-located with its science expertise.
  • Access patterns -- how do the users want to obtain the information? Do they want to obtain it using a declarative SQL type pattern? Even if this is true, this no longer mandates the use of a database, thanks to a recent Apache project or two.
  • Organization of science data -- science data is organized as file granules, that encapsulate a particular series of observations over space and time into a data unit -- science data is not organized in the relational model, as tables, with rows and columns for each row.

So, for me, while I think databases certainly have a purpose in ACID type operations, and environments where there are lots of transactions and file access patterns are ad-hoc (read/write and random access with append/replace anywhere), I think in the science realm life is different. Files are typically append-only and frequently read but not write.

What's more, the sheer impunity by which technologies like Hadoop and paradigms like Map Reduce have fostered a world-wide adoption and community of support leads me to believe that there is something there more than smoke, and something that certainly could benefit the scientific community in a more direct and immediate way.

To wrap it up though -- can we agree that the Map Reduce paradigm and Hadoop are really different beasts though than DBMS'es? They are not equivalent people!

Apache Tika 0.4 released - get it while it's hot

August 16th, 2009

I had the privilege of being the release manager for yet another Apache Tika release - 0.4 is out the door and has a number of major improvements over prior releases, including a major refactoring, separating out Tika's core components, from its parsers, from its external interfaces.

You can read the release announcement here.

You can grab Tika here.

WWW2009 in Madrid, Spain

April 24th, 2009

Hanging out in Spain for a week was a blast. Visited the Royal Palace, several museums (including the Prado), the Basilica church, and took a day trip to Toledo.

At this point, just looking forward to coming home!

Apache Tika 0.3 out the door

March 20th, 2009

Apache Tika, a sub-project of Apache Lucene, and a toolkit for content analysis and detection, has just made its 0.3 release.

You can grab the release from a nearby mirror here.