Data Eng Weekly


Hadoop Weekly Issue #47

08 December 2013

This week’s newsletter is one of the largest compilations of new releases in a while--Tez, Summingbird, Kafka, and the CDK all had releases. The tools and ecosystem around Hadoop seem to be maturing, and the momentum behind YARN is growing (the Cloudera blog has a great article about resource management on YARN). There’s a lot of great information to digest in this issue--enjoy!

Technical

In the second part in a series on the HBase Thrift interface, the Cloudera blog covers how to use the API to read from and write to HBase. Specifically, it covers creating a table, finding all tables, adding rows to thrift (including in batch), and getting rows. Scanning and deciding between Thrift and REST will be covered in the next and final part of the series.

http://blog.cloudera.com/blog/2013/12/how-to-use-the-hbase-thrift-interface-part-2-insertinggetting-rows/

Adam Kawa of Spotify has some fantastic slides about Hadoop in practice. After giving a brief introduction to HDFS, the slides go through some of the key design decisions of HDFS and a couple of real-world issues. Next, it goes through the details of MapReduce and two practical issues with it that Spotify had to solve.

http://www.slideshare.net/AdamKawa/hadoop-intheoryandpractice#!

A lot of interesting work has been done with YARN resource allocation and management. The Cloudera blog has an overview of the goals and motivation of YARN and its scheduler, details on the the ‘dominant resource fairness’ allocation strategy that YARN uses, and how YARN can enforce max-CPU utilization with Linux cgroups.

http://blog.cloudera.com/blog/2013/12/managing-multiple-resources-in-hadoop-2-with-yarn/

Every so often I hear that Spark’s feature set is a superset of MapReduce, and it will replace MR because in many cases it’s faster. I think that Spark has an important role to play, but it won’t soon completely replace MapReduce. At the recent Spark Summit, Eric Baldeschwieler presented on how Spark complements and integrates with Hadoop and his best guesses on what the future holds.

http://spark-summit.org/wp-content/uploads/2013/10/Baldeschwieler-SparkSummit2013v2.pdf

The Sentry Apache incubator project is a system for enforcing fine-grained (table and view-level) role-based authorization for Hive and Impala. The Cloudera blog has an overview of how to use Sentry with Hive, including configuration, the role of Sentry in the Hadoop security landscape, and a demo video.

http://blog.cloudera.com/blog/2013/12/how-to-get-started-with-sentry-in-hive/

In a post that starts talking about the SQL-on-Hadoop wars, the MapR blog has one of the best overviews of the goals of the Drill (the Apache incubator project). In short, Drill doesn’t require centralized metadata management. Rather, it can use the information embedded in JSON, Avro, or other self-describing data formats to evaluate queries on data. It’s worth noting, though, that Drill’s latest release was called “milestone 1,” and it’s not as mature as other software in this space.

http://www.mapr.com/blog/structured-sql-or-mongo-like-flexibility-with-hadoop-you-can-have-both

The MapR blog features an article describing how to backup data and metadata for Hive using MapR. Most of the post is devoted to setting up MySQL master-slave replication to replicate the Hive metastore, which is applicable to any Hive deploy. In addition to that, it describes how to use MapR FS’s replication (which can be inter or intra cluster) in order to replicate data changes.

http://www.mapr.com/blog/how-to-use-mapr-volumes-with-hive-and-mysql-for-mirroring

Infoworld has a good article about how procuring hardware for Hadoop can go against the grain of what your company is used to. In particular, it highlights the local disk vs. SAN debate and shares some relevant experience and advice. The author also theorizes that we might see some Hadoop appliances branded as a SAN in order to avoid the Hadoop special case in hardware procurement.

http://www.infoworld.com/d/application-development/never-ever-do-hadoop-232090

News

Apache Ambari, the management software for Hadoop, graduated from the Apache incubator this week. Ambari was originally spearheaded by Hortonworks and is part of Hortonworks HDP.

http://hortonworks.com/blog/apache-ambari-graduates-to-apache-top-level-project/

Hortonworks has announced that HDP 2.0 supports Ubuntu 12.04, the latest LTS. The new support of Ubuntu is in addition to existing support of Windows Server, CentOS, RedHat, and Oracle Linux. Hortonworks says that they now support OSes used by 99% of enterprises.

http://hortonworks.com/blog/hortonworks-data-platform-2-0-certified-for-ubuntu-12-04/

Releases

The Apache incubator project, Tez, released version 0.2.0-incubating this week. The release resolves over 400 issues. Among the important changes are support for Apache Hive and major updates to the Tez Engine API.

https://www.mail-archive.com/dev@tez.incubator.apache.org/msg00346.html

Hortonworks has announced a technical preview of Apache (incubating) Falcon. Falcon is a data processing and management system. The technical preview requires HDP 2.0 GA and supports RHEL 6, CentOS 6, and Oracle Linux 6.

http://hortonworks.com/blog/apache-falcon-tech-preview-available-now/

MapR announced that they’re integrating Hue into their distribution. Starting with MapR Distribution 3.0.2, Hue 2.5 is integrated. The 3.0.2 release of MapR also includes some fixes for Oozie.

http://www.mapr.com/blog/continuing-on-the-eco-friendly-journey-hue-2-5-beta-included-in-the-mapr-distribution

Cloudera announced version 2.5.5 of their ODBC Driver for Hive. This version improves support for authentication and encryption (SSL-encryption for client traffic, proxy support via delegation iDs).

http://community.cloudera.com/t5/Release-Announcements/Announcing-ODBC-Driver-Version-2-5-5-for-Hive/td-p/3687

Amazon Web Services announced a new tag feature for ElasticMapReduce, which allows tagging of clusters with up to 10 identifiers. The tags can be specified at cluster spin up and modified as the cluster is running.

http://aws.typepad.com/aws/2013/12/tag-your-elastic-mapreduce-clusters.html

Version 0.3.0 of Summingbird, the hybrid batch/real-time computation framework, was released this week. The release resolves over 30 issues. Major improvements include a pluggable cache API, acks for post-processing, and improved logging.

https://github.com/twitter/summingbird/releases/tag/0.3.0

Hue, the Web UI for Hadoop, released version 3.5.0. The new version includes over 250 commits, a new look and feel, SSO with a new SAML backend, a new facet UI for search, and much more. Hue seems to be moving really fast — version 2.5 was released only 4 months ago.

http://cloudera.github.io/hue/docs-3.5.0/release-notes/release-notes-3.5.0.html

The Cloudera Development Kit hit version 0.9.0. This version adds support for random set/get to HBase, Parquet via Crunch, and CSV. It also contains several updates to the Morphlines library.

http://cloudera.github.io/cdk/docs/0.9.0/release_notes.html

Apache Kafka released version 0.8.0. Kafka is a popular choice for transporting event data from application servers to HDFS. One of the most compelling new features in the 0.8.0 release is intra-cluster replication.

http://mail-archives.apache.org/mod_mbox/kafka-users/201312.mbox/%3C20131204020401.DFDF110DA0%40minotaur.apache.org%3E

Events

Curated by Mortar Data (http://www.mortardata.com)

Monday, December 9

SHUG8. Hive On Steroid (Olivier Renault) + Analytics @ Spotify (Henrik Landgren) (Stockholm, Sweden)
http://www.meetup.com/stockholm-hug/events/154148952/

Hadoop In-Depth & How Node.js Takes Over JEE on Scalable Big Data Elastic Cloud (Mountain View, CA)
http://www.meetup.com/Frontier-Real-time-Streaming-Big-Data-Virtualization/events/134231892/

Tuesday, December 10

COJUG - Apache Mahout (Columbus, OH)
http://www.meetup.com/techlifecolumbus/events/153561102/

First meeting of the Bay Area Cloudera User Group (San Francisco, CA)
http://www.meetup.com/Bay-Area-Cloudera-User-Group/events/149374172/

December Big Data Meetup (Budapest, Hungary)
http://www.meetup.com/Big-Data-Meetup-Budapest/events/138089032/

Join us for: Beyond Hadoop - Building a Big Data Platform as a Service (Cupertino, CA)
http://www.meetup.com/Tech-Talks-BlueKai/events/148677322/

Big Data Technologies - From Integration to Analysis: A full Big Data Scenario (Portland, OR)
http://www.meetup.com/Hadoop-Portland/events/150386712/

Real Time Twitter Integration into MongoDB and Hive with Related Analytics (Saint Augustine, FL)
http://www.meetup.com/HUGNOFA/events/148229602/

TriHUG Social + Lightning Talks (Durham, NC)
http://www.meetup.com/TriHUG/events/150758242/

NGP VAN and Elasticsearch + Elasticsearch with Hadoop (Washington, D.C.)
http://www.meetup.com/Elasticsearch-Washington-DC/events/152052972/

Real-time Trade Data Monitoring with Storm & Cassandra (New York, NY)
http://www.meetup.com/Big-Data-Warehousing/events/151419032/

Indy Big Data December 2013 Meetup (Carmel, IN)
http://www.meetup.com/IndyBigData/events/152926902/

Wednesday, December 11

Houston Hadoop Meetup Series (Houston, TX)
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/145698902/

Thursday, December 12

WHUG 15. What kind of data science do you need? - Agnieszka Zdebiak (Warsaw, Poland)
http://www.meetup.com/warsaw-hug/events/153793722/

HBase browser in HUE by Abraham Elmahrek of Cloudera (Los Angeles, CA)
http://www.meetup.com/Los-Angeles-HBase-User-group/events/152073322/

Design Patterns for Big Data Architecture (Sydney, Australia)
http://www.meetup.com/Big-Data-Analytics/events/153606372/

BDNSHH December (Hamburg, Germany)
http://www.meetup.com/BDNSHH/events/149337522/

R & Scala for fast in-memory predictions on Hadoop via H2O! (San Francisco, CA)
http://www.meetup.com/San-Francisco-Big-Data-Science/events/154123732/

Saturday, December 14

Hadoop Training Class (Costa Mesa, CA)
http://www.meetup.com/Future-Chief-Data-Scientists-in-Orange-County-CA/events/151027732/

Hadoop google+ hangout (Denver, CO)
http://www.meetup.com/Big-Data-for-Business/events/149759222/