Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.
Wukong is a great library to write map/reduce jobs for Hadoop from ruby.
Together they can be really great, because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.
Working with Wukong on Elastic Map Reduce
Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.
Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script (based on an example on stackoverflow):
sudo apt-get update sudo apt-get -y install rubygems sudo gem install wukong --no-rdoc --no-ri
Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this
elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh
The web tool for creating clusters has a space for specifying the path to a bootstrap script.
Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)