Data Eng Weekly


Hadoop Weekly Issue #169

08 May 2016

This week's issue is short and sweet. Topics covered include Apache Beam, MapR's quarterly results, the recent Kafka Summit, and a new open-source distributed unit test framework from Cloudera.

Technical

Elastic has written a root cause analysis of recent outages. A misconfigured ZooKeeper memory setting caused excess garbage collection, which ultimately lead to loss of the ZooKeeper quorum. The post describes a number of mitigation strategies they've implemented to prevent a similar problem in the future.

https://www.elastic.co/blog/elastic-cloud-outage-april-2016

The Cask blog has a recap of the recent Big Data Applications Meetup. The first of the talks was about Pachyderm, which is based on Docker containers and provides "Git for your data" semantics. The second was about the big data platform at TubeMogul, which is built on Hadoop, Hive, Spark, and Presto.

http://blog.cask.co/2016/05/pachyderm-and-tubemogul-share-their-big-data-application-platforms-and-experience/

Google and dataArtisans have both written about Apache Beam (formerly the Google Dataflow SDK). The Google post explains their motivation for open-sourcing and developing Beam, and the dataArtisans post talks about their support for the Beam model and how one should think about the relationship between the Flink and Beam APIs.

https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
http://data-artisans.com/why-apache-beam/

The IBM Hadoop dev blog has a run book for installing the Python, Scala, and R kernels for Jupyter notebooks. The post also describes how to connect to Spark and expose the notebook over SSL.

https://developer.ibm.com/hadoop/blog/2016/05/04/install-jupyter-notebook-spark/

This post describes how the Mongo Hadoop connector functions as a go-between for Spark and MongoDB.

https://x.ai/using-the-mongo-hadoop-connector-as-a-translation-layer-to-spark/

The Qubole blog has a post comparing the newest of the programming languages used for big data analysis—Python, R, and Scala.

http://www.qubole.com/blog/big-data/programming-language/

News

MapR announced that they had a record quarter with 99% growth in subscription licenses and a 146% dollar-based net expansion rate.

https://www.mapr.com/company/press-releases/mapr-achieves-another-record-quarter-99-software-subscription-license-growth

This article describes a recent benchmark comparing Google Cloud Dataflow and Apache Spark on the Google Compute Engine. Dataflow outperformed Spark 2x-5.7x (as always, it's best to evaluate your own workload rather than trusting benchmarks). The post also describes a "cold war" that is benefiting everyone using big data tools.

http://www.datanami.com/2016/05/02/dataflow-tops-spark-benchmark-test/

The Confluent blog has a recap from the recent Kafka Summit covering the pre-conference hackathon, keynotes, breakout sessions, and more.

http://www.confluent.io/blog/log-compaction-kafka-summit-edition-may-2016

Forbes has an overview of American Express' journey over the past five-years to adopt big data technologies. In the article, AMEX shares some tips and lessons learned, such as the difficulty of adopting new technologies (and how important buy-in from the top of the organization is), the challenge of hiring and retaining engineers, and more.

http://www.forbes.com/sites/ciocentral/2016/04/27/inside-american-express-big-data-journey/

Releases

Cask has announced version 3.4 of the Cask Data Application Platform (CDAP). The new release adds Cask Tracker, a new data lineage/audit/search system, updates the UI for Cask Hydrator, enhances Spark support, and more.

http://blog.cask.co/2016/05/announcing-cdap-release-3-4-introducing-tracker-next-gen-hydrator-enhanced-spark-support-and-much-more/

Cloudera has open-sourced dist_test, a new tool for running unit tests in parallel. With this tool, the unit test for projects like Hadoop and Kudu run in minutes instead of hours. The tools has bindings for both C++ and Java, and there's a website demoing its features.

http://blog.cloudera.com/blog/2016/05/quality-assurance-at-cloudera-distributed-unit-testing/

Google has announced a new integration between Google BigQuery and Drive to support saving of output to Google sheets.

http://techcrunch.com/2016/05/06/google-connects-bigquery-to-google-drive-and-sheets/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

GE IoT Predix Time Series & Data Ingestion Service Using Apache Apex (San Jose) - Tuesday, May 10
http://www.meetup.com/Big-Data-native-Hadoop-Ingest-and-Transform-Bay-Area/events/230507862/

Washington

Apache Spark Workshop and Combining ML Frameworks with Apache Spark (Bellevue) - Thursday, May 12
http://www.meetup.com/Seattle-Spark-Meetup/events/226391630/

Illinois

Introduction to Big Data Analytics Using Apache Spark and Apache Zeppelin (Chicago) - Thursday, May 12
http://www.meetup.com/futureofdata-chicago/events/230027634/

Ohio

Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, May 9
http://www.meetup.com/Cleveland-Hadoop/events/229283173/

Virginia

Apache Ranger: Securing Big Data in Hadoop (Reston) - Wednesday, May 11
http://www.meetup.com/DataDC/events/230633306/

New Jersey

Apache NiFi: Deep Dive - Ingestion Technology (Hamilton) - Tuesday, May 10
http://www.meetup.com/nj-hadoop/events/229611450/

New York

Apache Storm 1.0 with Taylor Goetz (New York) - Wednesday, May 11
http://www.meetup.com/New-York-City-Storm-User-Group/events/230651433/

Spark for Reactive Machine Learning: Building Intelligent Agents at Scale (New York) - Wednesday, May 11
http://www.meetup.com/Open-Source-Analytics-New-York/events/230755099/

CANADA

Spark with C* + Testing/Modelling in Ruby (Toronto) - Tuesday, May 10
http://www.meetup.com/Toronto-Cassandra-Users-Group/events/230766348/

Vancouver Spark Meetup: ApacheCon Extravaganza (Vancouver) - Tuesday, May 10
http://www.meetup.com/Vancouver-Spark/events/229692936/

IRELAND Scaling Up Genomics with Spark + Understanding Your Customers Using Public Data (Dublin) - Monday, May 9
http://www.meetup.com/hadoop-user-group-ireland/events/230464912/

UNITED KINGDOM

Spark Streaming Double Bill (London) - Thursday, May 12
http://www.meetup.com/Spark-London/events/230836039/

NORWAY

Use of Hadoop for Large Scale Machine Learning at Yahoo (Trondheim) - Wednesday, May 11
http://www.meetup.com/Trondheim-Big-Data/events/230834974/

BELGIUM

Cassandra Introduction & Dashboarding with Spark/Cassandra (Kontich) - Monday, May 9
http://www.meetup.com/Brussels-Cassandra-Users/events/230631627/

GERMANY

Big Data, Frankfurt v 2.0 (Frankfurt) - Thursday, May 12
http://www.meetup.com/Big-Data-Frankfurt/events/227415048/

INDIA

Second Spark Meetup (Pune) - Thursday, May 12
http://www.meetup.com/Pune-Apache-Spark-Meetup/events/230517940/

High-Speed Connectors for Spark (Bangalore) - Saturday, May 14
http://www.meetup.com/Bangalore-Spark-Enthusiasts/events/230864866/

AUSTRALIA

Fault Tolerant Streaming + Spark & Cassandra + Operationalise Machine Learning (Sydney) - Tuesday, May 10
http://www.meetup.com/Sydney-Apache-Spark-User-Group/events/229953293/