Data Eng Weekly


Hadoop Weekly Issue #152

10 January 2016

This week's issue has a lot of great technical content (including a bunch that I missed from December). Topics covered include performance testing of stream processing systems, new features in Apache Spark 1.6.0, and Apache Ranger. There's lots of great stuff demonstrating that 2016 is going to be an exciting year for the Hadoop ecosystem.

Technical

The Storm team at Yahoo has done a performance comparison of Flink, Storm, and Spark Streaming. The benchmark includes reading/deserializing JSON data, performing a filter and a join (with data from a Redis cluster), and windowing to count events and store them in Redis. For this use case, they measured throughput and latency on all three system. The post describes some of the key configuration settings and evaluation details. It concludes that there isn't a clear winner but finds that Storm and Flink show sub-second latencies at high throughputs whereas Spark streaming shows even higher throughput but at higher latencies.

http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

This post shows how to use Apache Drill to analyze astronomical data about the International Space Station and the Sun in order to identify the date of a picture of ISS solar transits. Drill supports some pretty complicated vector math in order to compute the relevant data points.

http://www.dremio.com/blog/predicting-international-space-station-solar-transits-using-built-in-sql-math-functions/

This post describes a few different mechanisms for setting up a Hadoop cluster on an Ubuntu server. It describes an install via debian packages (from the Apache Bigtop project) via the hadoop-ppa, a build via Bigtop for running as a Docker container, and a dev setup using mrjob (a python library) with Elastic MapReduce. For each, there are details of custom configs and instructions for running a simple MapReduce job to compute the value of pi.

http://tech.marksblogg.com/hadoop-up-and-running.html

Spark 1.6.0 (released this week, more below) adds Spark Datasets, which is a new type-safe API built atop of the DataFrame API. An introductory post shows some examples and quantifies (with some example benchmarks) how it improves memory usage and execution time. The API is available from Java and Scala.

https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html

The Cloudera blog highlights some improvements to how Apache Impala (incubating) handles Parquet data. Specifically, some pitfalls related to how Parquet and HDFS independently tune block size are now handled more smoothly.

http://blog.cloudera.com/blog/2016/01/new-in-cdh-5-5-apache-parquet-usability-improvements/

The IBM developer blog has posted some preliminary benchmarks of the recently released Apache Spark 1.6.0. The release contains a number of changes, including performance optimizations, which impact workflows in different ways. They compared performance of JSON processing, MLlib's K-Means, and SparkSQL queries across 3 (or 4) recent versions of Spark.

https://developer.ibm.com/hadoop/blog/2016/01/05/spark-1-6-0-performance-sneak-peek/

This post describes how Apache Ranger integrates with Apache Hadoop HDFS to secure access. Ranger provides centralized security policy management that works in conjunction with HDFS' built-in controls. The post includes some examples of configuring Ranger policies for a directory in HDFS.

http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger/

The big data analytics team at Cigna has built a stream-processing application that consumes data from Kafka, processes the data via Spark Streaming, and makes the data query-able via a RESTful HTTP API. The RESTful API pulls data from Impala using the Impyla Python API. The post describes a number of performance enhancements—configuration changes and improvements to caching and partitioning. These tunings and learnings should be really useful for anyone working with Spark Streaming and Kafka.

http://blog.cloudera.com/blog/2016/01/how-cigna-tuned-its-spark-streaming-app-for-real-time-processing-with-apache-kafka/

Hortonworks has posted a list of their most popular blog posts from 2015. These are mostly technical, covering topics like Hive, Storm, Spark, and releases of HDP.

http://hortonworks.com/blog/top-ten-blogs-from-2015/

This tutorial shows how to setup Elastic MapReduce with a separate instance of Apache Zeppelin for submitting jobs. This has the advantage of supporting multiple (or zero) clusters without needing to make major changes to the Zeppelin instance.

http://blogs.aws.amazon.com/bigdata/post/Tx2HJD3Z74J2U8U/Running-an-External-Zeppelin-Instance-using-S3-Backed-Notebooks-with-Spark-on-Am

The Cloudera blog has a post showing how to hookup the Ibis python library to Kudu to interact with data stored there. The article describes Kudu, demonstrates the Kudu python library, and shows how to use Ibis with Kudu tables.

http://blog.cloudera.com/blog/2016/01/interactive-analytics-on-dynamic-big-data-in-python-using-kudu-impala-and-ibis/

The Confluent "Log Compaction" blog has a bunch of highlights of recent developments in the Kafka community. There are lots of links and quick details about exciting new features (e.g. Kafka Connect and removing the zookeeper dependency for clients) and use-cases (Kafka at Microsoft, Kafka with Spring).

http://www.confluent.io/blog/log-compaction-highlights-in-the-kafka-and-stream-processing-community-january-2016

News

The Databricks blog has an article reviewing the progress of Spark over the past year. It covers the community evolution and adoption, new data science/platform/streaming APIs, and performance optimization work.

https://databricks.com/blog/2016/01/05/spark-2015-year-in-review.html

This article from the MapR blog has six predictions for big data for the next year. They include increased interest in streaming data, shorter time to value, centralization, and rapid adoption of Hadoop for healthcare and telecommunications.

https://www.mapr.com/blog/what-will-you-do-2016-apache-spark-kafka-drill-and-more

Releases

Apache Spark 1.6.0 was released this week. Among the major changes (the release includes many across several components) are a new Dataset API, unified memory management, improved parquet performance, improved state management for Spark Streaming, and several new algorithms for MLlib (online hypothesis testing, bisecting k-means clustering, and more).

http://spark.apache.org/releases/spark-release-1-6-0.html

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

HadoopSF January 2016 Meetup (San Francisco) - Tuesday, January 12
http://www.meetup.com/hadoopsf/events/225350988/

Washington

Jump Start into Apache Spark (Seattle) - Tuesday, January 12
http://www.meetup.com/Seattle-Spark-Meetup/events/221571232/

Colorado

Analytics with Spark and Cassandra (Denver) - Tuesday, January 12
http://www.meetup.com/Colorado-Cassandra-Meetup/events/227480011/

Illinois

Continuous Data Management for Hadoop and Spark: On-Premise or in the Cloud (Chicago) - Wednesday, January 13
http://www.meetup.com/Big-Data-Developers-in-Chicago/events/226883184/

South Carolina

Data Analytics Infrastructure (Charleston) - Tuesday, January 12
http://www.meetup.com/Charleston-Data-Analytics/events/227254233/

New York

Querying Network Packet Captures with Spark and Drill (New York) - Wednesday, January 13
http://www.meetup.com/New-York-Apache-Drill-Meetup/events/227168254/

First Meetup - Reactive Monitoring and Distributed Streaming (New York) - Thursday, January 14
http://www.meetup.com/Reactive-New-York/events/227703915/

Massachusetts

Continuous Data Management for Hadoop and Spark: On-Premise or in the Cloud (Boston) - Tuesday, January 12
http://www.meetup.com/Open-Source-Analytics-Boston/events/226759308/

Open Analytics Boston: Short Talks & Demos (Boston) - Thursday, January 14
http://www.meetup.com/Open-Analytics-Boston/events/227015357/

CANADA

Spark Basics (Montreal) - Wednesday, January 13
http://www.meetup.com/Montreal-Apache-Spark-Meetup/events/227554437/

IRELAND Elastic Big Data Processing with Myriad and Mesos. ETL Use Cases and Hadoop (Dublin) - Monday, January 11
http://www.meetup.com/hadoop-user-group-ireland/events/227456614/

GERMANY

Apache Spark, Scala, Reactive Technologies and Machine Learning Discussions (Berlin) - Tuesday, January 12
http://www.meetup.com/Big-Data-Developers-in-Berlin/events/227744512/

POLAND

Dive into Hadoop (HDInsight): Common Big Data Analysis Scenarios on Microsoft Azure (Krakow) - Wednesday, January 13
http://www.meetup.com/PLSSUG/events/226587245/

HUNGARY

Big Data Meetup - January 2016 (Budapest) - Monday, January 11
http://www.meetup.com/Big-Data-Meetup-Budapest/events/227250724/

TAIWAN

Spark Installation & MLLib - Wednesday, January 13
http://www.meetup.com/Apache-Spark-Hsinchu/events/227734886/