Data Eng Weekly


Hadoop Weekly Issue #154

24 January 2016

Often there isn't a clear theme to a week, but stream processing is the hot topic this issue. Google has submitted to the Dataflow SDK to the Apache incubator, there's a great article on streaming data processing from O'Reilly, and there are several articles about Apache Kafka. In addition, there is some fundraising news for two Hadoop ecosystem companies, are several releases, and is a mix of other content.

Technical

Datanami has a thorough comparison of SQL-on-Hadoop engines (both vendor-backed and open-source). The post has a useful bucketing of engines into batch-oriented, interactive, and in-memory as well as a discussion of other important considerations (such as supported file formats). It also notes that we'll likely see some consolidation in the near future, which is important to keep in mind as one evaluates tools.

http://www.datanami.com/2016/01/13/picking-the-right-sql-on-hadoop-tool-for-the-job/

The acmqueue has a great article about immutability in computing. The decreasing costs of storage has enabled systems built on immutable/append-only components such as GFS/HDFS (which are discussed in this post) and Kafka. In addition to these, the article explores several other types of systems (e.g. relational databases, distributed systems), hardware (SSDs), and system patterns (copy-on-write, replication in distributed systems, fault tolerance) that make use of or provide immutable semantics.

http://queue.acm.org/detail.cfm?id=2884038

O'Reilly has a long, in-depth article about streaming data processing. It's a follow up to the recent "Streaming 101" post, and it covers topics like event-time vs. processing-time, windowing, watermarks, triggers, and accumulation. The article is full of figures and animations describing these core concepts that make up the what, where, when, and how of data processing.

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

The Databricks blog has a post on the new features of MLlib in Apache Spark 1.6. The post describes (and links to relevant examples of) a few of these—pipeline persistence, new ML algorithms, and improved MLlib integration for SparkR.

https://databricks.com/blog/2016/01/21/mllib-highlights-in-spark-1-6.html

This presentation describes how Rocana is building a search system for large-scale (10s of TB/day) data atop of Kafka and HDFS. The slides present the reasons for building a custom search solution, the architecture of the system, how events are collected/partitioned, the write path through Kafka to HDFS, and the basics of the query system (which takes advantage of things like HDFS short circuit reads).

http://www.slideshare.net/esammer/high-cardinality-time-series-search-a-new-level-of-scale-data-day-texas-2016

On the heels of the new producer API in Kafka 0.8.1, version 0.9.0 introduced a new Consumer API. The new API removes the distinction between a simple and a high-level client, removes the dependencies on the Scala runtime and ZooKeeper, adds security extensions, and more. The post describes how to get started with the new client via code snippets, demonstrates an example polling client, discusses delivery semantics (which is related to offset management), and more.

http://www.confluent.io/blog/tutorial-getting-started-with-the-new-apache-kafka-0.9-consumer-client

This post explores a gotcha related to the old Kafka Producer APIs default support for byte arrays. It's a clear description of a rather subtle issue, and it provides good context on some of the Kafka Producer API internals.

http://www.agardner.me/kafka/big/data/partitioner/java/scala/byte/array/2016/01/23/kafka-partitioning.html

News

Hadoop Summit Europe is still a couple of months away, but the Hortonworks blog has previews of two of the community choice winners. The first is about Apache Flink at Capital One, and the second discusses machine learning with big data.

http://hortonworks.com/blog/overview-of-apache-flink-the-4g-of-big-data-analytics-frameworks/
http://hortonworks.com/blog/community-choice-winner-blog-machine-learning-big-data-look-forward-left-behind/

Google, along with developers from a number of other companies, have proposed incubating Google Dataflow SDK at the Apache incubator. The SDK provides a high-level API for batch and stream processing with a pluggable backend (Spark, Flink, single-node local runner, and Google hosted Cloud Dataflow are all supported).

http://googlecloudplatform.blogspot.com/2016/01/Dataflow-and-open-source-proposal-to-join-the-Apache-Incubator.html

Datanami has a summary of key points from the recent Forrestor report on Hadoop distributions. It mentions the distribution leaders (Cloudera, Hortonworks, IBM, and MapR), some of the differentiators among distros, market presence, and more.

http://www.datanami.com/2016/01/20/hadoop-market-is-neck-and-neck-forrester-says/

Hortonworks announced this week that they're seeking an additional $100 million in funding as part of a secondary share offering. Hortonworks stock was down after the announcement but made back some ground towards the end of the week.

http://siliconangle.com/blog/2016/01/20/hapless-hortonworks-shares-plunge-22-on-news-of-secondary-ipo/

Qubole, makers of the Qubole Data Service, announced that they've secured $30 million in Series C financing. In the post, Qubole notes that customers are processing over 250 petabytes each month using their platform across Amazon Web Services, Google Cloud Platform, and Microsoft Azure.

https://www.qubole.com/blog/big-data/series-c/

Releases

Version 0.1.0 of kudu-python was recently released. This is a python API to Apache Kudu (incubating) that uses the C++ Client API.

https://pypi.python.org/pypi/kudu-python

Apache Apex has announced version 3.3.0-incubating of the Malhar library. Malhar is a library of operators and adapters for real-time streaming applications. The new release contains a number of bug fixes, improvements, and new features such as support for anti and semi joins and support for Kafka 0.9.0's new consumer API.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201601.mbox/%3CCA+5xAo3XkOc5ABAp3nTvCwJgFyy1rU8rDjNxuJepXv_0b9iJOw@mail.gmail.com%3E

Cloudera has announced version 2.0 of Cloudera Director, their tool for managing CDH clusters in the cloud. The new release adds support for spot instances, high availability, kerberos configuration, automatic job submission, RHEL 7.1, and more. The Cloudera blog has many more details on Cloudera Director.

http://blog.cloudera.com/blog/2016/01/whats-new-in-cloudera-director-2-0/

Spark-TS 0.2.0 is the second version of the Spark time series library from Cloudera. The new release switched to java.time in order to support nanosecond precision, a more developed Java API, and more.

http://blog.cloudera.com/blog/2016/01/spark-ts-0-2-0-released/

Version 3.3.0 of the Cask Data Application Platform was released. Major features of the new release include improvements to CDAP metadata and the Cask Hydrator.

http://blog.cask.co/2016/01/cdap-3-3-0-is-out-check-out-whats-new/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Big Data Application Meetup (Palo Alto) - Wednesday, January 27
http://www.meetup.com/BigDataApps/events/227191025/

Data Ingest at Scale: Lessons from PlanetLabs and Uber (Mountain View) - Wednesday, January 27
http://www.meetup.com/SF-Bay-Area-Data-Ingest-Meetup/events/227924382/

Building and Scaling Data Pipelines (San Francisco) - Wednesday, January 27
http://www.meetup.com/Keen-IO/events/228068723/

Big Data for the Enterprise, Part 1 (San Francisco) - Wednesday, January 27
http://www.meetup.com/Big-Data-for-Business-Users/events/228036200/

Evening with Martin Odersky! + Spark Approximations + Twitter Algebird (San Francisco) - Thursday, January 28
http://www.meetup.com/Advanced-Apache-Spark-Meetup/events/226122226/

Washington

Seattle Scalability Meetup (Seattle) - Wednesday, January 27
http://www.meetup.com/Seattle-Scalability-Meetup/events/225163163/

Colorado

Apache Spark 101: Introduction and What's New (Englewood) - Tuesday, January 26
http://www.meetup.com/Denver-Cloudera-User-Group/events/227944582/

Texas

Hadoop, HBase, and Spark by John Leach (Houston) - Thursday, January 28
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/227872939/

Ohio

Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, January 25
http://www.meetup.com/Cleveland-Hadoop/events/226257989/

Florida

SPARKling Analytics by Ravi Nair (Jacksonville) - Tuesday, January 26
http://www.meetup.com/jaxbigdata/events/228059860/

Apache NiFi: Joe Witt of Hortonworks (Orlando) - Tuesday, January 26
http://www.meetup.com/orlandodata/events/227963685/

Georgia

Keeping Cool Under Pressure with Apache NiFi (Atlanta) - Thursday, January 28
http://www.meetup.com/Atlanta-Hadoop-Users-Group/events/227489527/

Virginia

Interactive Visualization + Leveraging Spark in a Hybrid OLTP/OLAP (Reston) - Tuesday, January 26
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/227559860/

Pennsylvania

DataPhilly January 2016 (Philadelphia) - Wednesday, January 27
http://www.meetup.com/DataPhilly/events/227686970/

New York

Real-Time Big Data (New York) - Wednesday, January 27
http://www.meetup.com/TechTalks-AppNexus-NYC/events/227915908/

CANADA

Toronto Apache Spark #5 (Toronto) - Wednesday, January 27
http://www.meetup.com/Toronto-Apache-Spark/events/227413504/

MEXICO

The Data Pub January 2016 (Mexico City) - Monday, January 25
http://www.meetup.com/thedatapub/events/228110998/

UNITED KINGDOM

Building Your First Spark Streaming Application (Bath) - Thursday, January 28
http://www.meetup.com/Apache-Spark-South-West-UK/events/227453625/

NORWAY

Big Data, No Fluff: Let’s Get Started with Hadoop #5 (Oslo) - Thursday, January 28
http://www.meetup.com/Oslo-Hadoop-Big-Data-Meetup/events/223471777/

SPAIN

Spark and the Combination of Different Modules (Madrid) - Wednesday, January 27
http://www.meetup.com/Madrid-Apache-Spark-Meetup/events/228087594/

FRANCE

Establishment of a Hadoop Big Data/Mesos Infrastructure (Paris) - Wednesday, January 27
http://www.meetup.com/Paris-Big-Data-Classes/events/227882770/

BELGIUM

Data Processing Using Amazon Web Services: A Panel Discussion (Antwerpen) - Tuesday, January 26
http://www.meetup.com/Brussels-Data-Science-Community-Meetup/events/226987575/

Kafka and HortonWorks Use Cases (Brussels) - Tuesday, January 26
http://www.meetup.com/bigdatabe/events/228157221/

GERMANY

Apache Flink Meetup Berlin #13: Roadmap 2016/Implementing BigPetStore (Berlin) - Tuesday, January 26
http://www.meetup.com/Apache-Flink-Meetup/events/228003362/

Python & Spark by Thorsten Greiner (Dusseldorf) - Wednesday, January 27
http://www.meetup.com/Dusseldorf-Data-Science-Meetup/events/227496388/

Big Data, Berlin (Berlin) - Thursday, January 28
http://www.meetup.com/Big-Data-Berlin/events/227414653/

ISRAEL

Getting the Most Out of HBase! Transactions and Advanced Caching (Tel Aviv-Yafo) - Wednesday, January 27
http://www.meetup.com/HBase-Israel-Meetup/events/227824252/

From Legacy DWH to State-of-the-Art Hadoop & Vertica Data Platform by AOL (Tel Aviv-Yafo) - Sunday, January 31
http://www.meetup.com/Big-Data-Israel/events/227714836/

INDIA

Exploring the Goodness of MapReduce, Hive & Spark (Gurgaon) - Thursday, January 28
http://www.meetup.com/ThoughtWorks-GGN-Geek-Night/events/227997513/

Interactive Analytics Using Apache Spark (Bangalore) - Saturday, January 30
http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/228030045/

Spark Streaming and MLlib (Hyderabad) - Saturday, January 30
http://www.meetup.com/HySpark/events/228251691/