Using Apache Tika to come up with a 2013 Apache projects Tag Cloud

July 3rd, 2013

Rich Bowen, former ASF Director, asked recently on the Community Development Lists if there was an updated Apache word cloud that we could use to update the Planet Apache page -- its word cloud of Apache projects was way outdated, from 1999.

Maybe unsurprisingly, this sounded like a job for Apache Tika. I managed to whip up a quick script to scrape the list of Apache projects from the projects that have defined ASC signing keys, listed here.

What follows is verbatim the script I came up with


shopt -s expand_aliases
export PATH=/usr/local/tika:${PATH}
alias tika="java -jar /usr/local/tika/tika-app-1.3.jar"

tika -t "" | \
 grep -v "pmc" | grep -v "DAV" | grep -v "Index of" \
| grep -v "Name" \
| grep -v "Parent" | awk '{print $1}' | \
cut -d. -f1 | sort | uniq

I saved this in a script

and made it executable. Running it produces a nice set of Apache projects that you can then dump into Wordle or Tagxedo, and other word cloud generators. Here's a couple I generated. Simple with Tika, see?

Open Source Summit 3.0: Communities

June 14th, 2013

On June 25, 26, 2013, I'll be participating in the Open Source Summit 3.0 Communities Meeting. This marks a 3rd time and 3rd straight year that I've been part of the organizing of the team and this will be the first year I'm not directly speaking at the event. You can find the other videos of me speaking linked at the bottom of this post.

Besides the 11 federal agencies that will be repped at the event (NASA, NIH/NCI, NLM, DARPA, Census, State Dept, DOE, etc.), the great part about this year's meeting is its focus on Open Source communities. Still not well understood especially in government, I'm excited to teach others how they work - especially how government can not only interact with communities that are already or externally open source; but how they can form their own open source communities, and grow them out in the wild. This a key issue for sustainability of software since NSF, NASA, DOD, NIH are all concerned with how the software extends beyond the grant, especially in this day and age of budget and fiscal crises.

In addition to being excited about attending and the conversations that will occur during the event, I just wanted to throw a shout out to the Planning Team culled from 11 federal agencies. I had a great time working with y'all and am looking forward to meeting some of you in person and to getting our gov OSS on!

Come join us June 25, 26 at NYU DC. Event registration can be found here.

Related Links

Apache Board of Directors

June 1st, 2013

I was privileged enough to be elected to the Apache Board of Directors on May 23, 2013.

All I can say is wow. This represents really not the end of a long journey since I've been contributing to the ASF starting in 2005, but more so a new beginning of the ability to help guide the foundation. I believe I was elected in large part b/c of the emerging class of people and projects at the ASF who are interested in scientific computing, which is really what I can hang my hat on as something that I regularly encourage in terms of bringing new projects to the ASF community.

Apparently I'm also a part of one of the most growing communities at Apache (OODT) and I'm also the most connected person in the ASF. Thanks to Rob Weir for the analysis and sweet viz.

I'm not sure what the future holds in terms of whether this will ever happen again, but it happened now, and I'm very thankful.

The current 2013-2014 ASF Board of Directors is provided below

  • Shane Curcuru
  • Doug Cutting (chairman)
  • Bertrand Delacretaz
  • Roy T. Fielding
  • Jim Jagielski
  • Chris Mattmann
  • Brett Porter
  • Sam Ruby
  • Greg Stein

Adding TripIt to your site - not hard!

October 3rd, 2012

So I scoured around Google, searching for adding TripIt Javascript to my site in order to display my upcoming and current trips. I'm just sick of maintaining my data over and over again and I've pretty much sold out to TripIt in order to record that type of info and share it out. I used to maintain my own list of upcoming and recent events (conferences, and stuff) on my site. Well to hell with that now.

So, scouring Google wasn't very helpful and in the end I finally found this article that described adding a "blog badge" to your site that ended up being what I was looking for. Basically you just go here on your account profile, enable blog badge at the bottom, grab the javascript it gives you (don't forget to click the Save button on the bottom right before embedding the Javascript on your site like I did ;) ) and boom, you have a blog badge that will list your upcoming trips and recent events. See my USC web site for mine under Recent and Upcoming Events.

Can the real Google URL, please stand up?

February 28th, 2012

I am so bleeping sick of Google's trackback and ad URL monetization nowadays. Gah.

When you get a list of results from Google now, instead of having clickable links to the URLs that the search engine points at, I get these:

Reading around on the web, I found this link that has on it a really small awesome Perl hack that uses the CGI module to strip out the real URL from google.

So I created a simple script called real_goog_url based on that hack:

% cat $HOME/bin/real_goog_url
#!/usr/bin/env perl

use CGI;

$p = CGI->new($ARGV[0]);
print $p->param('url');
print "\n";
I can use this script like so to get the real Goog url now!

% $HOME/bin/real_goog_url ""


Fortune and Glory

January 28th, 2012

Welcome to 2012 people!

Apparently I forgot how to write blog posts at some point last year, and so this is going to be a news roundup with special events since my last post.

  • Tika in Action is published in paperback and now available on
  • I wrote 13+ proposals to NASA, NSF and the NIH, most of which to date have not been funded.
  • Christian is 2.5 and growing up fast. He's already out of the crib and into his own bed.
  • Lisa, Christian and I took a Christmas road trip to Roswell, NM (15 hours from Rancho Cucamonga, CA on day 1), and then to San Antonio, TX, and then from there to Victoria, TX. Then we decided to drive 22 hours in a single day to get home. Yes, that's bucket list shit.
  • I discovered Yelp!.
  • I discovered MyFitnessPal.
  • I got Siri and an iPhone 4s.
  • I started Teaching again.
  • I made some awesome assignments for CSCI 572: Information Retrieval and Search Engines at USC.

Other than that, just trying to keep my head above water. I'll try and post more often but can't promise anything.

I'll leave you guys with my AGU Ignite talk on NASA ♥ Apache open source software.

Tika in Action: Principal Chapters and Content Delivered!

April 30th, 2011

We just got back the final reviews from Tika in Action and I'm compiling the final actions and responses to those comments. A long journey is nearing its end!

Tika in Action update: Ch9, 10 on MEAP, 12, and 15 done

March 23rd, 2011

Hey all, just a quick Tika in Action MEAP update. Chapters 9 and 10 are now available. Chapters 12 and 15 have been submitted to the publisher. 14 should be done this week, and hopefully 11 will follow shortly thereafter. At that point, the book will be fully drafted! I'm so excited!

NASA ♥ Apache - the sequel

January 4th, 2011

So, NASA just put out its press release regarding Apache OODT. The ASF one is soon to follow and now Information Week has just posted an article about OODT as well.

Thanks to the OODT PMC and all those contributors who are making the project a big success!

Tika in Action: Chapter 8 out the door

December 15th, 2010

Welp, like I just posted on my Facebook, I've finally pushed Chapter 8 of Tika in Action out the door.

Should be coming to a MEAP near you sometime soon!