Data Eng Weekly


Hadoop Weekly Issue #201

22 January 2017

This week is a short but very sweet issue, with fantastic articles on Apache Kafka, Apache Spark, Apache Airflow (incubating) and more. Also, Twitter has written about the scale of their infrastructure, and there's a great post describing building materialized views with Kafka. In releases, Apache HBase and Apache Kudu both had releases this week.

Technical

The morning paper is covering several papers from this year's Conference on Innovative Data Systems Research (CIDR). These include "SnappyData: A unified cluster for streaming, transactions, and interactive analytics" and "Dependency-driven analytics: a compass for uncharted data oceans." Rather than linking to each of the week's post, here's the post that introduces the coverage for the week.

https://blog.acolyer.org/2017/01/15/innovation-experience-based-insight-and-vision-at-cidr-17/

Twitter has written about the challenges and lessons learned in scaling its infrastructure, of which 19.6% is Hadoop (and close to 50% is made up of data systems). The post covers network traffic, storage (which mentions that Twitter stores 500PB of data across multiple Hadoop clusters, the largest of which is 10k nodes), puppet at scale, and more.

https://blog.twitter.com/2017/the-infrastructure-behind-twitter-scale

The MapR blog has a post about using Apache Kafka, Apache Spark, and Apache Ignite for a streaming application that writes data out to Apache HBase. Using five or so performance tunings (varying from tweaking JVM settings to fixing timeouts), the Spark Streaming job became 12x faster. The post also covers some details of how the system was stabilized (such as running Spark in standalone mode rather than via Mesos).

https://www.mapr.com/blog/performance-tuning-apache-kafkaspark-streaming-system

This tutorial walks through using Spark's structured streaming to load CloudTrail audit logs into a data warehouse built on S3 and Apache Parquet. While this is mostly a getting started tutorial, it also includes a discussion of how to make this more production-ready by setting up fault tolerance using a checkpoint location.

https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html

There have recently been security incidents involving Hadoop clusters that expose themselves on the internet. Cloudera has put together a guide (which is aimed at Cloudera but is in parts generally applicable) with basic steps for locking down a Hadoop cluster.

http://blog.cloudera.com/blog/2017/01/how-to-secure-internet-exposed-apache-hadoop/

The Google Cloud Platform Medium publication has a post on using Apache Airflow (incubating) with BigQuery. It highlights some of the useful features of Airflow, such as support for jinja template substitution when building sql queries.

https://medium.com/google-cloud/airflow-for-google-cloud-part-1-d7da9a048aa4

This article illustrates common problems related to performance and caching of data systems, an overview of some common solutions, and a high-performant solution based on a materialized views. From there, the post contains code snippets and describes how to use Kafka Streams and local cache to compute and serve requests. The post also describes how to integrate with other systems like Redshift and ElasticSearch.

http://theza.ch/2017/01/16/updating-materialized-views-and-caches-using-kafka/

News

JanusGraph is a new effort to build a scalable graph database based on the Titan project. Interestingly, it's being run at the Linux Foundation rather than the Apache Software Foundation.

https://www.linux.com/blog/Linux-Foundation-welcomes-JanusGraph

The Apache blog has a post about Apache Ignite, which is an in-memory data fabric. Ignite supports many use cases such as transactional updates and SQL queries, and it is integrated with Spark, Hadoop, YARN, and more.

https://blogs.apache.org/foundation/entry/the-asf-asks-have-you

As mentioned above in reference to the Cloudera post on securing Hadoop, there have been incident in which publicly addressable Hadoop installs have had data deleted. This post provides more details on what's been happening and how to secure your setup.

http://www.threatgeek.com/2017/01/open-hadoop-installs-wiped-worldwide.html

Releases

Version 2.5.0 of the workflow system, Luigi, was released. There are a number of changes in the release, most notably improvements to the BigQuery support.

https://github.com/spotify/luigi/releases/tag/2.5.0

Apache HBase 1.3 was released this week, with over 1700 resolved issues. There are several improvements, including date-based tiered compactions, improvements to the metrics system, and client optimizations for looking up region locations.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201701.mbox/%3CCAHxLZBWn6eLPTjLG7NxpVNQzf-M1T984N90W9bswSUVDk5vYPA@mail.gmail.com%3E

Apache NiFi released version 1.0.1 and 1.1.1 in December. If you haven't upgraded, there is some more urgency now that a XSS-vulnerability has been disclosed.

http://mail-archives.apache.org/mod_mbox/www-announce/201701.mbox/%3CCALJK9a4TNPvGav_UxwLQvqY0M2mRNWnvQBvu58p7%3D_ZfD1_AGg%40mail.gmail.com%3E

Apache Kudu 1.2.0 was released. This release improves the implementation of strong consistency guarantees, fixes a corruption bug with ext4 on RHEL 6, and more.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201701.mbox/%3CCADY20s7qghYLN96fAU96FokFCvk%2B_9t%2BKFFYv_aBw_PddYN-Og%40mail.gmail.com%3E
http://kudu.apache.org/releases/1.2.0/docs/release_notes.html

At VLDB 2015, Facebook published a paper on their time series database, Gorilla. Recently, they open-sourced an implementation called Beringei. It's written in C++ and there's a Dockerfile to get started.

https://github.com/facebookincubator/beringei/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Where the Worlds of Data Eng & Data Science Merge! (Santa Clara) - Thursday, January 26
https://www.meetup.com/BigDataCloud/events/236887540/

VEGAS: The Missing Matplotlib for Spark, Presented by Netflix (San Francisco) - Thursday, January 26
https://www.meetup.com/SF-Big-Analytics/events/235425610/

Washington

Real-Time Data Ingestion & Streaming: Talks from Avvo, Expedia and Confluent (Seattle) - Wednesday, January 25
https://www.meetup.com/Seattle-Apache-Kafka-Meetup/events/236855696/

Seattle Scalability Meetup: Evolution of Machine Learning Sys w/ Stripe Radar (Seattle) - Wednesday, January 25
https://www.meetup.com/Seattle-Scalability-Meetup/events/234882892/

Georgia

Jumpstart Your Big Data Analytics Journey with the Hortonworks Sandbox and Hive (Atlanta) - Thursday, January 26
https://www.meetup.com/futureofdata-atlanta/events/236647565/

North Carolina

January CHUG: What's the Big Deal with Hadoop? The Elephant in the Room (Charlotte) - Thursday, January 26
https://www.meetup.com/CharlotteHUG/events/227293791/

Virginia

Big Data Tools in Azure (McLean) - Monday, January 23
https://www.meetup.com/NOVASQL/events/234673703/

Pennsylvania

Big Data Governance and Security in Apache Hadoop: Healthcare Client Use Case (Philadelphia) - Thursday, January 26
https://www.meetup.com/futureofdata-philadelphia/events/235465127/

SPAIN

Seminar: Fundamentals of Apache Spark (Madrid) - Friday, January 27
https://www.meetup.com/big-data-open-school/events/236828465/

NETHERLANDS

Kafka Connect & Repeatable Deployment of Kafka Streams Topologies on Kubernetes (Utrecht) - Thursday, January 26
https://www.meetup.com/Kafka-Meetup-Utrecht/events/236692198/

GERMANY

IoT Tech Meetup #3: Streaming Analytics (Berlin) - Tuesday, January 24
https://www.meetup.com/IoT-Innovation-Lab/events/236791423/

Apache Kafka Meetup with Jay Kreps and Michael Noll (Munich) - Wednesday, January 25
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/236402498/

What Is New in Hadoop 3.0 (Dusseldorf) - Wednesday, January 25
https://www.meetup.com/Big-Data-Hadoop-Spark-NRW/events/236723039/

SWITZERLAND

19th Swiss Big Data User Group Meeting (Zurich) - Monday, January 23
https://www.meetup.com/swiss-big-data/events/236020110/

POLAND

Facebook Presto: SQL-on-Anything­ (Warsaw) - Tuesday, January 24
https://www.meetup.com/warsaw-hug/events/236467094/

SRI LANKA

Processing Big Data Using Apache Hive & Microsoft Azure Machine Learning Studio (Colombo) - Tuesday, January 24
https://www.meetup.com/LKBigData/events/236812354/