Data Eng Weekly


Hadoop Weekly Issue #61

16 March 2014

This issue of Hadoop Weekly is overflowing with top-notch technical articles. There’s coverage of several parts of the ecosystem, from Zookeeper to Oozie to YARN. In addition, Kafka, Zookeeper, and Tez saw releases this week, and new features of the Kafka and Tez were releases were detailed in depth.

Technical

Episode 19 of the All Things Hadoop podcast has an interview with Adam Fuchs, Apache Accumulo PMC member and committer. The podcast covers the Accumulo data model, implementation, client-server architecture and more.

http://allthingshadoop.com/2014/03/13/big-data-with-apache-accumulo-preserving-security-with-open-source/

A post on the Pinterest engineering blog explains the evolution of their Zookeeper deployment. It talks about how they use Zookeeper for service discovery, some of the failure scenarios that can occur with Zookeeper, some early attempts they made to mitigating these failures, and the ultimate solution that Pinterest built. The solution uses a separate Zookeeper daemon per server that writes configuration files to the local file system for services to consume. It’s similar to AirBnB’s SmartStack, if you’re familiar with that.

http://engineering.pinterest.com/post/77933733851/zookeeper-resilience-at-pinterest

The Cloudera blog has a post on Oozie High Availibility, which is implemented as an active-active system. For synchronization in the HA system, Oozie uses Zookeeper for distributed locks. It also requires a HA database and a load-balancing strategy for accessing the cluster. The post describes some of the subtler parts of the system, such as retrieving log files and security in more detail.

http://blog.cloudera.com/blog/2014/03/inside-apache-oozie-ha/

Another post on the Cloudera blog has an interesting analysis of using solid-state drives (SSDs) for MapReduce. SSDs provided higher sequential and much higher random throughput than hard-disk drives (HDDs). The performance comes at a much higher cost per TB, though, and MapReduce’s sequential I/O achieves maximum throughput from HDDs. The post concludes that the cost of SSDs outweighs the performance gains. This is one of the first analyses of its kind that I’ve seen, and I hope we see more in the future (especially with other applications like HBase and Spark as well as using a larger number of smaller SSDs).

http://blog.cloudera.com/blog/2014/03/the-truth-about-mapreduce-performance-on-ssds/

Performing a major version upgrade of your Hadoop distribution is a harrowing task. Luminar recounts their experiences upgrading from HDP 1 to HDP2, a summary which includes information that’s relevant regardless of your distribution. For instance, Luminar worked with Hortonworks to build a full script for the upgrade and performed a walkthrough (both practices that I’ve found useful in the past). The post also talks about some things which went wrong.

http://hortonworks.com/blog/luminar-thoughts-migrating-hadoop-smoothly-hdp1-hdp2/

Hortonworks has reposted part of an analysis by a Hadoop contributor on the trajectory and makeup of the Hadoop source code repository. While the company-centric analyses are always controversial, there are some really interesting take-aways around the new lines of code and changed lines of code (2013 saw significantly more new lines of code vs 2011-12, but fewer changes to existing lines of code). The post also contains some commentary on the role of the Apache Software Foundation in the role of Hadoop development.

http://hortonworks.com/blog/innovations-contributions-apache-hadoop/

It’s looking more and more like 2014 is going to be the year of Apache Spark and other MapReduce successor frameworks. A post on Dice contains a good overview of Spark including a concise overview of Spark Streaming. The author gives one of the first reviews I’ve seen from someone using Spark in practice (the author notes that GraphX, which is still in beta, is a bit buggy) albeit on a 6-node cluster.

http://news.dice.com/2014/03/12/apache-spark-next-big-thing-big-data/

A Hadoop rack awareness script informs the NameNode to which rack a particular node belongs. The information is used to allocate data blocks across racks to survive a rack failure. While several example scripts can be found online for Linux, building a script for Windows is less common. This post walks-through building a script using Windows PowerShell, which is a scripting language built on .NET.

http://blogs.msdn.com/b/carlnol/archive/2014/03/14/implementing-hadoop-rack-awareness-with-powershell.aspx

The Python Natural Language Toolkit, or NLTK, is a batteries-included natural language processing framework. NLP problems tend to be easy to adapt to MapReduce given that a text corpus can often be split into documents, paragraphs, sentences, etc. This post covers using NLTK and the mrjob python/hadoop library to find the most common proper nouns in a dataset (in this case Moby Dick).

http://empirewindrush.com/tech/2014/03/13/pythons-elephants-whales/

The sequoia blog has a post about doing Lucene indexing with Hadoop MapReduce. The post, which includes several snippets of example code, describes the problem and elaborates on a custom OutputFormat that the implementation uses to write Lucene indexes to a temporary location on the local file system before copying them to HDFS on task completion.

http://blogs.sequoiainc.com/blogs/hadoop-lucene-indexing-with-mapreduce

Oracle R Advanced Analytics for Hadoop is paid software for running distributed computations with MapReduce from R. A post from the Rittman Mead blog has an overview of this product as well as detailed instructions on setting it up on a CDH4.5 cluster running on RHEL. Oracle provides an evaluation version for developers to test it out.

http://www.rittmanmead.com/2014/03/running-r-on-hadoop-using-oracle-r-advanced-analytics-for-hadoop/

The SequenceIQ blog has an overview of configuring the YARN capacity scheduler for several queues, and examples showing how to submit jobs to a particular queue. They also have some code-snippts showing how to parse data from the YARN scheduler API to inspect the queues and jobs at runtime.

http://blog.sequenceiq.com/blog/2014/03/14/yarn-capacity-scheduler/

Pivotal Chief Scientist Milind Bhandarkar recently gave a talk entitled “Extending Hadoop for Fun & Profit.” The talk covers the basics of MapReduce, a real-world example of extending Hadoop’s FileInputFormat to support MPEG video data, Hadoop scalability, the Hadoop shuffle, YARN, Hamster (MPI on YARN), and more. The slides contain a good mix of low-level technical details and big-picture architecture discussion.

http://www.slideshare.net/hadoop/extending-hadoop

News

HBaseCon host Cloudera has announced the keynotes and breakout sessions for the conference, which takes place in May in San Francisco. Keynotes include speakers from Google, Facebook, and Salesforce.com.

http://blog.cloudera.com/blog/2014/03/hbasecon-2014-speakers-keynotes-and-sessions-announced/

Gartner recently released their annual “Magic Quantrant for Data Warehouses” report, and Datanami has a recap of it. For the first time, Gartner has included the offerings of several Hadoop and NoSQL vendors—including Cloudera, MarkLogic, and Amazon Web Services (for RedShift and Elastic MapReduce). Datanami has more details on the report, including some of Gartner’s predictions like “few of the upstart data warehouse vendors will survive past 2016."

http://www.datanami.com/datanami/2014-03-13/hadoop_and_nosql_now_data_warehouse-worthy:_gartner.html

Qubole has compiled a list of Hadoop influencers to follow on twitter. It’s a great list if you're getting started with Twitter or Hadoop and need a list of folks active in the community to follow for the latest news.

http://www.qubole.com/hadoop-influencers/

Releases

Version 0.12.0 of the Kite SDK was released. Kite is a library for building Hadoop systems, and the new release includes new MapReduce support and new features in the morphlines library (which is a framework for facilitating ETL).

http://community.cloudera.com/t5/Release-Announcements/Announcing-Kite-SDK-0-12-0/m-p/7296#M26

Apache Tez 0.3 was released. Tez is a framework for doing distributed computation on a data flow graph, a generalization of the MapReduce framework. The new release includes support for secure Hadoop and improved scalability, fault tolerance, and stability. A post on the Hortonworks blog highlights some of the testing they’ve done at scale and the upcoming integration of Tez with Hive, Pig, and Cascading.

http://mail-archives.apache.org/mod_mbox/tez-dev/201403.mbox/%3C6E26CB74-2607-4E48-9D88-ADC1635C952F@apache.org%3E
http://hortonworks.com/blog/apache-tez-0-3-released/

Apache Kafka 0.8.1 was released. Kafka is a distributed messaging system that’s often used for data ingestion as part of a Hadoop deployment. Despite the patch-level version increment, the new release includes several new features to make Kafka easier to operate and a new log compaction feature. A write-up by Kafka committer and PMC member Jay Kreps has more details on the release.

http://markmail.org/message/dlqshiklh4k37xui
http://blog.empathybox.com/post/79427855885/whats-new-in-kafka-0-8-1

Ferry is a new system that lets you run distributed systems on a single Linux machine using Docker. It's a system that will be quite useful for building prototypes by running isolated instances in linux containers. Ferry currently includes support for Cassandra, Hadoop, and Gluster/OpenMPI.

http://ferry.opencore.io/en/latest/index.html

Apache Zookeeper 3.4.6 was released. The new release includes a large number of bug fixes and improvements.

http://zookeeper.apache.org/doc/r3.4.6/releasenotes.html

The open-source Mortar Framework, which is the self-proclaimed “Rails for Pig”, has a new release that substantially eases starting a Pig REPL. By streamlining the Pig install behind-the-scenes, starting a Pig REPL is only a single command once the mortar repo has been cloned.

http://blog.mortardata.com/post/79265428248/pig-latin-made-really-easy

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Anomaly Detection presented by Ted Dunning (San Francisco) - Monday, March 17
http://www.meetup.com/sfmachinelearning/events/167352882/

Intro to Hadoop: Hype or Reality? you decide with Kevin Crocker (Palo Alto) - Wednesday, March 19
http://www.meetup.com/Pivotal-Open-Source-Hub/events/165112072/

Practical Machine Learning: How To Decide What Really Matters (Palo Alto) - Wednesday, March 19
http://www.meetup.com/SF-Bay-Areas-Big-Data-Think-Tank/events/170814802/

Bay Area Hadoop User Group HUG Monthly Meetup (Sunnyvale) - Wednesday, March 19
http://www.meetup.com/hadoop/events/125191592/

Csaba Toth Presents Hadoop (Fresno) - Wednesday, March 19
http://www.meetup.com/Central-CA-NET-Users/events/167331932/

Apache HBase 0.98 by Andrew Purtell of Intel Apache (Los Angeles) - Thursday, March 20
http://www.meetup.com/Los-Angeles-HBase-User-group/events/169526272/

Colorado

YARN and the new Hadoop core (Boulder) - Wednesday, March 19
http://www.meetup.com/Boulder-Denver-Big-Data/events/166313992/

Texas

Advanced Hadoop Based Machine Learning (Austin) - Wednesday, March 19
http://www.meetup.com/Austin-ACM-SIGKDD/events/167555922/

Missouri

St. Louis Hadoop Users Group Meetup (St. Louis) - Tuesday, March 18
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/166702892/

Pennsylvania

Introduction to Hbase (Philadelphia) - Tuesday, March 18
http://www.meetup.com/PhillyDB/events/169767822/

New York

NYC Next Generation Hadoop Architecture talk and hands on Pivotal Hadoop (New York) - Thursday, March 20
http://www.meetup.com/Pivotal-Open-Source-Hub/events/169381212/

UNIGROUP 20 March 2014 Meeting: Apache Hadoop (New York) - Thursday, March 20
http://www.meetup.com/Unigroup/events/170417452/

CANADA

Monthly Solution Architect Scrum (Toronto) - Thursday, March 20
http://www.meetup.com/TorontoHUG/events/169706762/

HUNGARY

Big Data Cassandra (Budapest) - Monday, March 17
http://www.meetup.com/Big-Data-Meetup-Budapest/events/164646842/

AUSTRALIA

Hadoop MongoDB See John Ballment of Bizcubed build dashboards (Sydney) - Thursday, March 20
http://www.meetup.com/Open-Analytics-in-Sydney/events/168500562/

NORWAY

Stream processing with Storm (Trondheim) - Wednesday, March 19
http://www.meetup.com/Trondheim-Big-Data/events/167050622/

GERMANY

Hadoop Ecosystem Use Cases (Munich) - Wednesday, March 19
http://www.meetup.com/Hadoop-User-Group-Munich/events/164412252/

ISRAEL

Never Ending Data Streams Big Data with Storm Kafka Angular and D3.js (Tel Aviv) - Thursday, March 20
http://www.meetup.com/full-stack-developer-il/events/166864612/

INDIA

Hadoop March MeetUp 2014 (Bangalore) - Friday, March 21
http://www.meetup.com/Bangalore-Hadoop-Meetups/events/169769862/

Practical MapReduce Programming MapRed-a-thon MapReduce Patterns (Pune) - Sunday, March 23
http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/events/170688202/

ENGLAND

March 2014 Meetup Featuring Bloomberg and Elasticsearch (London) - Friday, March 21 http://www.meetup.com/es-london/events/168380502/