Data Eng Weekly


Hadoop Weekly Issue #144

08 November 2015

After skipping last week, this issue has a lot of content. Notably, there have been a bunch of releases over the past two weeks—Hadoop, Tajo, Phoenix, Slider, Apex, and Storm. In news, Hortonworks announced quarterly results, and there's a new free eBook "Hadoop with Python." Technical content includes tutorials (Apex and Kudu+Impala) and internals (Kafka and Phoenix).

Technical

The DataTorrent blog has a tutorial for writing an Apache Apex application in Scala. The tutorial shows how to setup a Maven project, write a LineReader, Parser, and Application, and run the application with dtcli.

https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/

The Confluent blog has a post describing how Kafka implements "request purgatory"—tracking requests that haven't yet succeeded or encountered an error. The original implementation uses Java's DelayQueue, which shares performance characteristics with a priority queue. The new design uses Hierarchical Timing Wheels, which offer faster, tunable performance characters. The post describes the implementation in detail and gives an overview of performance benchmarks comparing the old and the new.

http://www.confluent.io/blog/apache-kafka-purgatory-hierarchical-timing-wheels

Hortonworks has a post describing the components and features of Spark that they've worked on in the past year, and where they're concentrating effort for the future. Past work includes ORC support, an Ambari stack definition for Spark, machine learning library improvements, and documentation updates. Future work includes maturing Apache Zeppelin, an entity disambiguation library, a new Spark + HBase integration, the ability to persist RDDs to HDFS's memory tier, and making Spark streaming more robust.

http://hortonworks.com/blog/spark-hdp-perfect-together/

The recently released Apache Phoenix 4.6 includes support for declaring ROW_TIMESTAMP as part of a table's primary key. BY doing so, the value is stored using HBase's native row timestamp, which provides performance gains. Particularly, when scanning regions with HFiles that haven't been compacted, the ROW_TIMESTAMP information can be used to skip entire files. This is particularly handy when reading recently-written data. The introductory blog post describes the optimization in more details and shows example query response times with this feature enabled and not.

https://blogs.apache.org/phoenix/entry/new_optimization_for_time_series

Kudu, the new storage engine from Cloudera, integrates with Impala for SQL access. This post describes how to setup Impala with Kudu (this currently requires a custom build of Impala), how to tell Impala about data stored in Kudu, how to perform various SQL operations (both read and write/update queries), and more.

http://blog.cloudera.com/blog/2015/11/how-to-use-impala-with-kudu/

This post describes the types of RDD persistence available in Spark. The default is memory-only, which is performant but can lead to OutOfMemoryError's. The post has a brief overview of the performance characteristics and trade-offs of several other options.

https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-3-rdd-persistence/

This tutorial describes how to use Apache Ambari to install and configure the Tachyon FileSystem, which is a memory-centric distributed storage system. The post also has a brief example of using TachyonFS from Spark.

https://developer.ibm.com/hadoop/blog/2015/11/04/installing-tachyon-0-8-0-on-iop-4-1-4-2/

Depending on data sizes and distributions, an inner join in MapReduce can be performed efficiently in a few different ways. This post describes, in a high-level, several of the strategies for implementing an inner-join with MapReduce. For each (e.g. reduce-side, map-side), the post describes some of the relevant Hadoop APIs.

https://haifengl.wordpress.com/2015/11/04/inner-join-with-mapreduce/

Myriad is a system for running YARN atop of a Mesos cluster. This post looks at how to use Docker's overlay network plugin to isolate YARN clusters (with the ResourceManager and NodeManager running inside of Docker). All clusters share a common distributed file system, which can be accessed via another network bridge. The post has many more details about and code (including Dockerfiles and scripts) for implementing the solution.

https://www.mapr.com/blog/docker-global-hack-day-on-demand-yarn-clusters

News

Hortonworks announced quarterly results this week. They reported a loss of $0.74/share (adjusted) on $33.1 million in revenue, both of which beat the average analyst estimate (of those surveyed by Zacks Investment Research).

http://www.cnbc.com/2015/11/04/the-associated-press-hortonworks-reports-3q-loss.html

Cask Data, makers of the Cask Data Application Platform for building Apache Hadoop solutions, announced a $20 million Series B round of financing.

http://www.prnewswire.com/news-releases/cask-announces-20-million-series-b-financing-led-by-safeguard-scientifics-300173074.html

The DataBricks blog has a recap of last week's Spark Summit EU. The post highlights and links to the slides for several of the talks from the sessions and keynotes.

https://databricks.com/blog/2015/11/06/its-a-wrap-a-lookback-at-spark-summit-in-amsterdam.html

"Hadoop with Python" is a new, free eBook from O'Reilly. It covers the Snakebite Python library, the mrjob MapReduce framework, writing Pig UDFs in Python, PySpark, and the Luigi Python workflow scheduler.

http://www.oreilly.com/programming/free/hadoop-with-python.csp

MapR announced their best ever quarter of bookings, in which they saw 160% year-over-year increases in bookings and 200% growth in deal size.

https://www.mapr.com/company/press-releases/mapr-announces-record-bookings

Releases

Apache Phoenix 4.6, the SQL framework for HBase 0.98, 1.0, and 1.1, was released. The new release includes support for HBase native timestamps, a correlation variable, an alpha-version of a web-app for viewing trace information, and more.

http://mail-archives.apache.org/mod_mbox/phoenix-user/201510.mbox/%3CCAMfSBKKHoQdkf73R9gHd0vpc167MG9HJQPwLtSY+bLypCBkBAQ@mail.gmail.com%3E

Apache Tajo, the SQL-on-Hadoop data warehousing system, released version 0.11.0. The new release adds support for nested record types, ORC files, Python UDF/UDFA, tablespaces, and multi-queries. The release also includes improved performance for the JDBC drivers, joins, and more.

https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces81

Apache Hadoop 2.6.2 was released last week. It includes a number of fixes to YARN and MapReduce, which have been backported from the 2.7 and 2.8 lines.

http://mail-archives.apache.org/mod_mbox/hadoop-general/201510.mbox/%3CCADbBEnt7wwnEidk%2Br_E7-fpz9T2uDKy5dBOf7GJLaDVOm5nGig%40mail.gmail.com%3E

Spark TFOCS (Templates for First-Order Conic Solvers) is a "general purpose optimization package for constructing and solving mathematical objective functions." The introductory post has examples of using TFOCS for solving LASSO linear regression and linear programming problems.

https://databricks.com/blog/2015/11/02/announcing-the-spark-tfocs-optimization-package.html

Version 2.9.1 of Apache Curator, the java librariy for Apache ZooKeeper, was released. The version includes several bug fixes and a new recipe for group membership.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201511.mbox/%3CCANykduaFK4YKEdpQ6MtNy-23CdoEy6L=e8DUpRwmAd0NHF=RDA@mail.gmail.com%3E

Apache Slider 0.81.1-incubating was released. Slider is a framework and application for deploying existing distributed systems on YARN. The new release fixes several bugs and contains a few new features/improvements.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201511.mbox/%3CD25D072C.175A1%25jmaron@apache.org%3E

Apache Apex has released its first version, 3.2.0-incubating, since joining the Apache incubator. Apex is a data processing system for streaming and batch, and the new release contains many patches atop of the 3.1.0 release.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201511.mbox/%3CCA+5xAo1mS-BMT=Xk_q287_j5m6ngtaT8QEEED0zfQhXtgrnOtA@mail.gmail.com%3E

Apache Storm 0.10.0 has been released. In beta since June, this major new version adds support for secure multi-tenant deployments, Flux (a new framework for defining storm topologies), an improved logging framework, streaming ingest to Hive, and more.

http://storm.apache.org/2015/11/05/storm0100-released.html

A maintenance release of the previous major version of Storm was also release. Version 0.9.6 resolves 10 issues.

http://storm.apache.org/2015/11/05/storm096-released.html

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Apache Kafka and the Rise of the Stream Data Platform (San Francisco) - Tuesday, November 10
http://www.meetup.com/SF-Big-Analytics/events/225050211/

Evening with Google Cloud, Distributed DataFrame, and Apache Flink (Mountain View) - Wednesday, November 11
http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/225673273/

Open Data Platform Initiative Is Now Open for Business: Here's What It Means (Palo Alto) - Thursday, November 12
http://www.meetup.com/Open-Data-Platform-Group/events/225397821/

Best Practices with Airflow: An Open Source Platform for Workflows & Schedules (San Francisco) - Thursday, November 12
http://www.meetup.com/SV-Data-Engineering/events/225771700/

Deep Dive on Spark Project Tungsten: Largest Performance Optimizations to Date (San Francisco) - Thursday, November 12
http://www.meetup.com/Advanced-Apache-Spark-Meetup/events/223666812/

Washington

Spark MLlib: From Integration to Production (Seattle) - Wednesday, November 11
http://www.meetup.com/Seattle-Spark-Meetup/events/220003836/

Colorado

Building a Hadoop Data Application (Denver) - Thursday, November 12
http://www.meetup.com/Denver-Cloudera-User-Group/events/225953593/

Ohio

Spark Smorgasbord (Mason) - Wednesday, November 11
http://www.meetup.com/Cincinnati-Apache-Spark-Meetup/events/226036076/

North Carolina

Conquer Big Data Challenges in Streaming, Security and Data Flow in IoT! (Charlotte) - Tuesday, November 10
http://www.meetup.com/Charlotte-Internet-of-Things/events/225878880/

District of Columbia

Next Generation Accumulo: Iterator Tutorial and Spark (Washington) - Tuesday, November 10
http://www.meetup.com/Accumulo-Users-DC/events/224655060/

New York

Real Time Big Data Processing on AWS (New York) - Tuesday, November 10
http://www.meetup.com/Big-Data-Warehousing/events/225946292/

Rhode Island

Roundtable: AWS Lambda and Kinesis, Experiences and Best Practices (Providence) - Tuesday, November 10
http://www.meetup.com/Providence-Distributed-Systems-Meetup/events/226407677/

UNITED KINGDOM

Managing Data in Mesos: Examining Storage Options + How to Build a Data Pipeline (London) - Wednesday, November 11
http://www.meetup.com/London-Mesos-User-Group/events/226534498/

FRANCE

Cassandra/Kafka & Zeppelin (Paris) - Tuesday, November 10
http://www.meetup.com/Cassandra-Paris-Meetup/events/226216536/

GERMANY

Stream & Batch Processing with Apache Flink and Event-Time Windowing (Munich) - Wednesday, November 11
http://www.meetup.com/Big-Data-Developers-in-Munich/events/226576873/

Big Data Analytics with Cassandra & Spark (Karlsruhe) - Thursday, November 12
http://www.meetup.com/Big-Data-User-Group-Karlsruhe-Stuttgart/events/225726517/

First Spark-Munich Meetup @ "Big Data Munich" (Munich) - Thursday, November 12
http://www.meetup.com/Spark-Munich/events/226380925/

TURKEY

We're Talking Azure HDInsight! (Istanbul) - Saturday, November 14
http://www.meetup.com/Istanbul-Azure-Meetup/events/226364963/

ISRAEL

Spark & Dataframes for Hundreds of Multi-Tenant Customers & Billions of Events (Tel-Aviv) - Tuesday, November 10
http://www.meetup.com/Big-Data-Israel/events/225943962/

SINGAPORE

November Meetup: Hadoop Security (Singapore) - Friday, November 13
http://www.meetup.com/BigData-Hadoop-SG/events/226576483/