Data Eng Weekly


Hadoop Weekly Issue #121

17 May 2015

It seems like every week there is at least one exciting new release. This week, Apache HBase 1.1.0 is atop the list, but there were also new release of Apache Sqoop, Apache Curator, and Apache Knox. For technical deep-dives, there are posts on these projects as well as coverage of Apache Flink, Apache HDFS, Apache YARN, and Apache Spark. In news, Hortonworks reported quarterly earnings this week, and there's a recap of the recent Hadoop Bug Bash.

Technical

The Cloudera blog has a post describing a new feature in Apache Hadoop 2.6.0 and CDH 5.4.0—hot swapping of datanode drives. To perform the swap without restarting the DataNode daemon, the system makes use of another new feature, live reconfiguration via the Reconfigurable framework. The post describes how to make these changes via the command line and with Cloudera Manager.

http://blog.cloudera.com/blog/2015/05/new-in-cdh-5-4-how-swapping-of-hdfs-datanode-drives/

The Apache Flink blog has a post on how Flink manages memory to minimize overhead and GC pressure. Specifically, Flink stores objects in a collection of 32KB MemorySegments, uses custom serializers (which have special support for primitives, arrays, Tuples, case classes, and pojos), makes use of fixed-length sort keys for efficient sorting, and operates directly on binary data whenever possible. In addition, the post shows how this strategy performs for sorting data in comparison to an on-heap array and Kryo-serialization.

http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html

This tutorial on the MapR blog shows how to use PySpark and MLlib to target and classify customers in a fictitious streaming audio platform. The post shows how to wrangle data into the proper format and then use MLlib's logistic regression implementations to train and evaluate models.

https://www.mapr.com/blog/classifying-customers-mllib-and-spark

This post describes some of the confusion created by the terms "consistency" and "availability" when it comes to distributed systems. In particular, these terms have very strong meaning in terms of the CAP theorem—semantics which often don't match what you want in a production system. The post has a clear overview of the terms, and it includes Zookeeper as a case study. It's a good read for anyone working with distributed systems.

https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html

The LA Big Data Users Group recently hosted a talk on Apache Ignite (incubating), which is a distributed framework for in-memory data management. Among Ignites many features is a drop-in Hadoop accelerator which will run existing MapReduce jobs in-memory. Both the slides and the video are up on slideshare.

http://www.slideshare.net/sawjd/introduction-to-apache-ignite-tm-incubating-by-nikita-ivanov-of-gridgain

A post on the Scalding blog has an update on running Scalding/Cascading atop of Apache Tez. The post describes a benchmark job that is 20 Cascading Flows (420 steps in Hadoop, 20 DAG in Tez) across 10k likes of Scala. In the two test datasets, speedups are 2.25x and ~18x. Since this test about a month ago, the developers have found and fixed a number of bugs—the post stops just short of saying the integration is production-ready.

http://scalding.io/2015/05/scalding-cascading-tez-%E2%99%A5/

Sqoop2 has recently gained support for using PostgreSQL as a repository (in addition to an embedded Derby DB). A post on the ingest.tips blog has more details on the Sqoop2 Repository API, the automated testing to validate the new implementation, and some of the trickier implementation details.

http://ingest.tips/2015/05/12/postgresql-repository-added-to-sqoop2/

This post describes several features of the upcoming Apache Slider 0.80-incubating release: docker-based deployment, zero-package cluster definition, packaging improvements for dependencies/plugins, and improvements to placement strategies. For the latter, there is a description of the improvements as they pertain to long-lived services like Kafka and HBase, YARN labels, placement escalation, and more. There's also a discussion of some features planned for the future.

http://steveloughran.blogspot.co.uk/2015/05/dynamic-datacentre-applications.html

The Cloudera blog has a guest post on lessons learned working with Spark. The lessons cover three areas: memory management, data movement, and speed. There are a number of good tips, such as using broadcast variables to do efficient joins between large and small RDDs.

http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/

This post on the Hortonworks blog describes YARN's supports for scheduling based on virtual core (vcore) resources (in addition to memory).  The scheduler calculation becomes trickier with multiple resources, which is why the CapacityScheduler added the DominantResourceCalculator. In addition to detailing how the new calculator works, the post describes what the expected impacts of using the DominantResourceCalculator are and how to configure YARN to use it.

http://hortonworks.com/blog/managing-cpu-resources-in-your-hadoop-yarn-clusters/

The latest release of Sqoop2 supports both simple authorization and authorization via Apache Sentry (incubating). The ingest.tips blog has a post describing how to configure Sqoop2 with role-based access controls using the default and Sentry-backed authorization handlers.

http://ingest.tips/2015/05/15/role-based-access-control-in-sqoop2-2/

The Apache blog has two posts describing improvements in the latest release of Apache HBase (more details below). The first post describes two improvements to the Scan API: RPC chunking (which improves handling of larger rows) and scanner heartbeat messages (for when a scanner only infrequently returns rows). The second post describes request throttling, which is a new QoS setting in the 1.1.0 release. After enabling the setting, throttles can be set on the user, table, or namespace level.

https://blogs.apache.org/hbase/entry/scan_improvements_in_hbase_1
https://blogs.apache.org/hbase/entry/the_hbase_request_throttling_feature

News

There are several upcoming conferences in the next few months. Hadoop Summit is June 9-11 in San Jose, Spark Summit is June 15-17 in San Francisco (see link below for a promo code), MesosCon is August 20-21 in Seattle, and Flink Forward is October 12-13 in Berlin (call for abstracts is open now).

http://2015.hadoopsummit.org/san-jose/
https://databricks.com/blog/2015/05/11/spark-summit-2015-in-san-francisco-is-just-around-the-corner.html

http://events.linuxfoundation.org/events/mesoscon
http://flink-forward.org/

Apache Geode is a new incubator project derived from the Pivotal GemFire core codebase. Geode is a distributed, in-memory database.

http://www.infoworld.com/article/2908861/hadoop/pivotal-gemfire-open-source-geode.html

SCALE has an interview with Kafka architect and Confluent CEO Jay Kreps. The article covers a lot of topics, including the creation of Kafka at LInkedIn, scaling the data platform at LinkedIn, the role of open-source, and Confluent. 

https://medium.com/s-c-a-l-e/from-scaling-linkedin-to-selling-a-nervous-system-for-enterprise-data-f380455a4dd3

Hortonworks reported quarterly earnings this week. Revenue is up 167% year-over-year with a net-loss of $0.77/share, both which beat analyst estimates.

http://www.businessinsider.com/hortonworks-shares-surge-12-after-big-earnings-beat-2015-5

The Altiscale blog has a recap of the Apache Hadoop Global Bug Bash. The event saw contributions from folks in several time zones and resulted in over 100 issues resolved. There are some preliminary plans for another bug bash this fall.

https://www.altiscale.com/hadoop-blog/the-power-of-community-apache-hadoop-global-bug-bash/

Releases

Apache Knox Gateway 0.6.0 was recently released. Among the new features are REST APIs for Storm, caching for LDAP authentication, SSL mutual authentication, and improved support for load balancers. The Hortonworks blog has more on these features.

http://hortonworks.com/blog/announcing-apache-knox-gateway-0-6-0/

Pentaho Labs has announced support for Apache Spark. The integration supports unifying existing Spark jobs with the Penthao platform and using Spark SQL engine to power the Pentaho front-end. Pentaho is approaching Spark with a discerning eye—particularly when it comes to multi-tenancy. Datanami has an interview with Penthao's CTO in which he describes some of their concerns.

http://www.datanami.com/2015/05/12/pentaho-eyes-spark-to-overcome-mapreduce-limitations/

Apache Curator, which is a java library for Apache Zookeeper, released version 2.8.0. Curator makes working with Zookeeper much easier by implementing a number of best practices and common patterns. The release has a number of bug fixes and improvements.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201505.mbox/%3CCABRiMSFbG3Cmf+in=AHYn7uRTEtsgZMQv7DnY2D8-nLMFLwnyg@mail.gmail.com%3E

Apache Sqoop released version 1.4.6 and version 1.99.6 (from the Sqoop2 branch). Both versions include a number of bug fixes and new features (e.g. Parquet support in 1.4.6 and Apache Sentry integration for 1.99.6).

http://mail-archives.us.apache.org/mod_mbox/www-announce/201505.mbox/%3CCAHBV8WdQZR5-cw86zBe_P3qo2EaVPumHPYYhbmWo3rp+zuz0Vw@mail.gmail.com%3E
http://mail-archives.us.apache.org/mod_mbox/www-announce/201505.mbox/%3CCAOvM-chUGGM04jFn9rsK5b=KJA-pRPXuPuiu2KD19Emcm3kWoQ@mail.gmail.com%3E

Cloudera Enterprise 5.4.1 was released. The point release contains fixes for HDFS, YARN, MapReduce, HBase, Hive, and more. There are also improvements to Cloudera Manager and Cloudera Navigator.

http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Enterprise-5-4-1-CDH-5-4-1-Cloudera-Manager/m-p/27546#M65

Apache HBase 1.1.0 was released. This new version has a number of bug fixes and improvements. In addition to the features described in the posts above, the new version has an async RPC client, improved compaction controls, per-column family flush, support for writing the WAL to SSD, and support for using memcached for the HBase block cache.

http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZa%3DGuNoGT%3Dz-4e2_W2rpSuaJ%3DHJTPLCx92xrLhkOyKLUnXSg%40mail.gmail.com%3E

Hermes is a new project providing a message broker API atop of Apache Kafka. It provides an HTTP API for clients, a UI to simplify common operations, and docker images for quickstart.

http://hermes.allegrotech.io/

Events

Curated by Datadog ( http://www.datadoghq.com )

UNITED STATES

California

Revisiting the MapReduce Paradigm: An R-Specific View (Berkeley) - Tuesday, May 19
http://www.meetup.com/r-enthusiasts/events/222287371/

Spark Streaming and GraphX at Netflix (Los Gatos) - Tuesday, May 19
http://www.meetup.com/spark-users/events/222101339/

Spark Monitoring (Sunnyvale) - Wednesday, May 20
http://www.meetup.com/Pepperdata/events/221659926/

Oregon

Spark 2: Random Forests at Scale (Portland) - Wednesday, May 20
http://www.meetup.com/Portland-Data-Science-Workshops/events/220569344/

Colorado

Intro to Apache Ignite & Semi-Supervised Learning (Denver) - Tuesday, May 19
http://www.meetup.com/Data-Science-Business-Analytics/events/222324800/

Options & Capabilities When Deploying R Analytics on Hadoop (Denver) - Wednesday, May 20
http://www.meetup.com/Boulder-Denver-Big-Data/events/221578716/

Texas

Learn about Cloud Elephants: HaaS (Dallas) - Wednesday, May 20
http://www.meetup.com/Big-Data-in-the-Big-D/events/221446625/

Illinois

ETL Pipelines with Spark (Chicago) - Wednesday, May 20
http://www.meetup.com/Chicago-Spark-Users/events/222121657/

Michigan

Cloudera Product Roadmap and a Special Talk on Spark! (Southfield) - Wednesday, May 20
http://www.meetup.com/greatlakes_cug/events/221712802/

Ohio

Doug Cutting at the CHUG (Mayfield Village) - Monday, May 18
http://www.meetup.com/Cleveland-Hadoop/events/220387634/

Georgia

Hadoop Ecosystem and Spark (Alpharetta) - Tuesday, May 19
http://www.meetup.com/Atlanta-Cloudera-Users-Group/events/221712463/

Virginia

DC Spark Mini-Summit and 1-Year Meetup Celebration (Arlington) - Tuesday, May 19
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/221596490/

New York

NiFi and Kafka (New York) - Tuesday, May 19
http://www.meetup.com/Apache-Kafka-NYC/events/222202035/

Storm: A Big Data Tool for Your Small Data Problems (New York) - Wednesday, May 20
http://www.meetup.com/New-York-City-Storm-User-Group/events/222282603/

URUGUAY

First Meeting: Introduction to Apache Spark (Montevideo) - Tuesday, May 19
http://www.meetup.com/Montevideo-BigData-DataScience-Meetup/events/221843545/

UNITED KINGDOM

Special Event with MapR & Ted Dunning (London) - Wednesday, May 20
http://www.meetup.com/hadoop-users-group-uk/events/222067798/

NETHERLANDS

A Night of Cassandra and Spark at ING (Amsterdam) - Wednesday, May 20
http://www.meetup.com/Netherlands-Cassandra-Users/events/221965626/

ITALY

Marcel Kornacker, Impala Tech Lead (Milano) - Tuesday, May 19
http://www.meetup.com/HUG-Italy/events/222294062/

ISRAEL

First HBase IL Meeting (Tel Aviv-Yafo) - Tuesday, May 19
http://www.meetup.com/HBase-Israel-Meetup/events/222318843/

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit http://hadoopweekly.com