Data Eng Weekly


Hadoop Weekly Issue #25

07 July 2013

There's still a bunch of news trickling out of Hadoop Summit, so despite the short week in the US due to 4th of July, there's a lot of content for this week's issue. There are also several announcements unrelated to the summit -- Apache HBase and Apache Flume releases, a new "Heroku for Hadoop" system from Mortar Data, a couple of great tutorials for Apache Whirr and Apache Avro, and Yahoo! breaking the GraySort record with Hadoop. This issue is also an exciting milestone -- being the 25th (and one shy of half of a year!). Thanks to everyone that's given me positive feedback (and passed along articles!) over the past several months.

Technical

Cloudera and NGDATA, the makers of the Lily Customer Intelligence Platform, collaborated on Cloudera Search for HBase. NGDATA has been coupling HBase and Solr together for years, and they have open-sourced a number of projects related to Solr indexing of HBase. This article talks about the history of HBase+Solr at NGDATA, gives some background on the Cloudera/NGDATA collaboration, and references some open-source projects from the folks at NGDATA.

Mortar Data has announced a new feature for their Hadoop-as-a-service offering that provides "Heroku-like deploy for Hadoop." In other words, deploying your changes is as easy as pushing your code (and likewise to rollback) with git, just like Heroku. Coupled with their scheduling system, each scheduled run will use your most-recent push.

http://blog.mortardata.com/post/54443883261/git-based-deployment-scheduling

Parcel is a new file format for distributing binary artifacts with Cloudera Manager. Unlike RPMs, multiple versions of a Parcel can be installed simultaneously (although only one is active). This post gives some background on parcels, dives into the parcel file format, and gives an example of building a custom parcel (for lzo compression).

http://blog.cloudera.com/blog/2013/07/one-engineers-experience-with-parcel/

Apache Avro has gained wide adoption as a storage format in HDFS due to its self-describing data format, which makes it easy to integrate with other systems since you don't have to deal with codegen (like with thrift or protobufs). This tutorial covers Avro's integration with several of the main components in the Hadoop ecosystem -- MapReduce, Hadoop Streaming, Apache Hive, and Apache Pig. These tutorials have everything you need to get going with each of the frameworks, which could save you hours or days if you're just getting started.

http://www.michael-noll.com/blog/2013/07/04/using-avro-in-mapreduce-jobs-with-hadoop-pig-hive/

At Hadoop Summit, Alan F. Gates of Hortonworks presented on the Stinger initiative. His presentation includes an overview of some of the recent optimizations (in memory joins, collapsing of group by/order by), experimental results demonstrating massive speedups with the optimizations, new features (decimal data type, OVER clause and more) and a discussion on some future work (reducing startup time, improved optimizer, better caching).

http://www.slideshare.net/alanfgates/stinger-hadoop-summit-june-2013

Apache Whirr is a coordination system for running distributed services in the cloud. There are a number of tutorials that detail how to use Whirr with Amazon EC2, but this is the first that I've seen that details how to use Whirr with an Apache CloudStack deployment.

http://buildacloud.org/blog/271-big-data-on-demand-with-apache-whirr.html

The Hadoop team at Yahoo! released some results for the Gray Sort Benchmark using over 2000 nodes and Apache Hadoop 0.23.7. With this setup, they were able to sort a whopping 1.4TB/minute -- up from the previous record of 0.725TB/minute -- to capture the new Gray Sort record. The post has some more details about the benchmark, including some of the rules and detailed performance numbers.

http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-092704088.html

News

InformationWeek has an interview with Raymie Stata, CEO of Altiscale. Altiscale has a different business mode than other Hadoop vendors -- they will offer a Hadoop service that's aimed at established Hadoop users (other Hadoop-as-a-Service vendors tend to target users that are new to Hadoop). Specifically, they are targeting users of 10 or 20 node and hope to help those users grow to hundreds of nodes with their hosted service.

http://www.informationweek.com/cloud-computing/software/altiscale-preps-large-scale-hadoop-as-a/240157650

The Syncsort blog has a good recap of Hadoop Summit (with only a short sales pitch at the end). As the post explains, YARN was definitely the largest topic of discussion, but there were also a number of discussions around performance, file formats, security, and more.

http://blog.syncsort.com/2013/07/a-birds-eye-view-of-the-elephant-hadoop-summit-2013/

Big Data Republic also has a recap of Hadoop Summit, including a list of vendor announcements and some anecdotes from the sessions. The author also notes that the conference was very well done, and that "the excitement reminds me dearly of the early JavaOne conferences…"

http://www.bigdatarepublic.com/author.asp?doc_id=265222§ion_id=2809

'BinaryPig' is a new framework used by security researchers to analyze malware binaries. The framework will be unveiled at Black Hat Security Briefings later this month, and the code open-sourced at the same time. The article quotes one of the author's of BinaryPig as saying, "Big Data technology is going to revolutionize the security industry." It's interesting to read about Hadoop being used in new and innovative ways.

http://www.darkreading.com/threat-intelligence/binarypig-uses-hadoop-to-hunt-for-patter/240157505

Releases

Cloudera released a new beta of their Search product, Search 0.9.1. This release adds the ability to do real-time indexing of data during ingestion to HBase (previously only HDFS was supported).

https://groups.google.com/a/cloudera.org/d/msg/cdh-user/LDuwQ3S8Yyg/opfModvrM5YJ

DataStax Java Driver 1.0.1 was released. It contains a number of bug fixes and improvements.

https://groups.google.com/a/lists.datastax.com/d/msg/java-driver-user/JXqTJY105t4/-XEyhfwZsAkJ

Apache Flume 1.4.0 includes a number of new features developed over the past 6 months, since the Apache Flume 1.3.1 release. Highlights include: a new JMS source, a Solr Sink, support for SSL when using Avro-RPC, and support for an embedded flume agent.

http://mail-archives.apache.org/mod_mbox/flume-user/201307.mbox/%3CCAJLbxRYQtYODUQQKQTFRe7ScMhdBVkkXYPLk2QwYXGkv0jXtYA%40mail.gmail.com%3E

DataStax released the GA version of their C# Driver for Apache Cassandra. It shares all of the same major features as the Java Driver -- including load balancing, connection pooling, failover, and node discovery.

http://www.datastax.com/dev/blog/datastax-csharp-driver-is-now-final

Apache HBase 0.94.9 was released and is considered the new stable version. It has a number of bug fixes and minor improvements and contains contributions from 18 individuals. Releases from the 0.92 and 0.94 series can do a rolling upgrade to this version without downtime.

http://mail-archives.apache.org/mod_mbox/hbase-user/201307.mbox/%3C1372960689.26972.YahooMailNeo%40web140603.mail.bf1.yahoo.com%3E

Events

Curated by Mortar Data ( http://www.mortardata.com )

Tuesday, July 9
Applied Data Science with Soren Macbeth (New York, NY)
http://www.meetup.com/NYC-Data-Science/events/127822492/

Thursday, July 11
Solr + Hadoop = Big Data Search (Cambridge, MA)
http://www.meetup.com/bostonhadoop/events/126171012/

Thursday, July 11
Finding a needle in a stack of needles - adding Search to the Hadoop Ecosystem (San Jose, CA)
http://www.meetup.com/BigDataGurus/events/126907642/

Thursday, July 11 through Saturday, July 13
The Fifth Elephant (Bangalore, Karnataka - India)
https://fifthelephant.in/2013