Data Eng Weekly


Hadoop Weekly Issue #184

28 August 2016

This week's issue has several tutorials, an update on Flume/Kafka integration, and a great overview of flavors of distributed databases. In news, Altiscale is being acquired, and there's a post exploring the role of analytics RDBMSes. Finally, there were several releases this week including Hadoop and Kudu.

Technical

Apache Gearpump (incubating) is a streaming engine built on Akka. This presentation describes Gearpump's architecture by looking at several "hard parts" (e.g. exactly once, out-of-order) of stream processing that it solves.

http://www.slideshare.net/manuzhang/apache-gearpump-nextgen-streaming-engine

This tutorial describes how to consume tweets using Twitter's streaming API and process them with Kafka Streams. The code to accompany the post is available on github.

https://www.madewithtea.com/processing-tweets-with-kafka-streams.html

Apache Flume 1.7 will be switching to the new Apache Kafka client APIs. This post describes some of the changes coming with that change—for example, support for the Flume Avro Event, different configuration settings for producers/consumers, and a one-time manual migration of offsets (from Zookeeper to Kafka).

http://blog.cloudera.com/blog/2016/08/new-in-cloudera-enterprise-5-8-flafka-improvements-for-real-time-data-ingest/

This post makes a compelling argument that the hardest and most impactful feature of distributed systems is multi-tenancy and not scalability. It also describes recent work and plans for multi-tenancy in Kafka.

http://www.confluent.io/blog/sharing-is-caring-multi-tenancy-in-distributed-data-systems

The Syncsort blog has an interview with Hortonworks' Owen O'Malley. The first part covers the history of Hadoop and the ORC file format, and the second part covers Spark, Tez, and Hive's Live Long and Prosper.

http://blog.syncsort.com/2016/08/big-data/expert-interview-series-paige-robertsowen-omalley-part-1/
http://blog.syncsort.com/2016/08/big-data/expert-interview-series-part-2-hortonworks-co-founder-technical-fellow-owen-omalley-origins-hadoop/

When storing data in an external table, partitions must be manually added to the Hive metastore to appear in results. This post describes how to use AWS Lambda to trigger the addition of the new partitions when data arrives in S3.

http://blogs.aws.amazon.com/bigdata/post/Tx2LYJMAED4TVXY/Data-Lake-Ingestion-Automatically-Partition-Hive-External-Tables-with-AWS

This tutorial describes how to use the XML Spark package to read data stored in XML and convert it into JSON.

https://medium.com/@anicolaspp/spark-packages-from-xml-to-json-404689e765ca#.dnxp09727

This post provides a thorough overview of the types of distributed databases, the various trade-offs (and the relation to the CAP theorem), and the features of several major projects. Using this information, the post presents a decision tree for picking a database given application-specific constraints.

https://medium.baqend.com/nosql-databases-a-survey-and-decision-guidance-ea7823a822d#.nqi9sfpp4

News

Videos from Big Data Day LA 2016 are available on youtube. There are talks covering several Hadoop ecosystem projects, including Apache Beam, Apache Spark, and Apache Kudu.

https://www.youtube.com/results?search_query=%22Big+Data+Day+LA+2016%22

Altiscale, the big data as a service vendor, is being acquired by SAP for $125 million.

http://venturebeat.com/2016/08/25/sap-altiscale/

This post asks the question of whether analytic RDBMes are still useful and argues that they are in limited use cases (notably "hard-core business intelligence"). The post has some additional observations about this landscape—notably around Hadoop/Spark and open-source.

http://www.dbms2.com/2016/08/28/are-analytic-rdbms-and-data-warehouse-appliances-obsolete/

Releases

Cask announced version 3.5 of the Cask Data Application Platform. This release adds fine-grained authorization, secured impersonation, a GUI-based Spark streaming pipeline builder, and more.

http://blog.cask.co/2016/08/cdap-3-5-enterprise-security-drag-and-drop-spark-streaming-and-much-more/

Apache Geode released version 1.0.0-incubating.M3. Geode is an in-memory, distributed storage system with SQL-like capabilities. It powers the commercial product Pivotal GemFire.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201608.mbox/%3CCAEwge-HFjeDta3X5_CNX7dp1nHWhxRgw6XnyDhp0mnNmt5Z0RA@mail.gmail.com%3E

Apache Kudu 0.10.0 was announced. This release improves stability of the high availability implementation, improves Spark integration, and more. Another post on the Kudu blog describes new range partitioning features of the release.

https://kudu.apache.org/2016/08/23/apache-kudu-0-10-0-released.html
https://kudu.apache.org/2016/08/23/new-range-partitioning-features.html

Version 1.7.0 of Apache Ignite has been released. Ignite is an in-memory data fabric, and it includes support for SQL queries, data grids, and more. This release includes a number of bug fixes and improvements, including the ability to join non-collocated data.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201608.mbox/%3C7517C0B2-4725-465D-B393-C1AB290736C3@apache.org%3E

Apache Hadoop 2.7.3 was released. It includes over 200 resolved issues including support for writing metrics to Graphite, support for extended attributes, improvements to the NFS gateway, modernized web UIs, and more.

http://hadoop.apache.org/docs/r2.7.3/

WePay has recently written some interesting posts about how they use Kafka and BigQuery. This week, they open-sourced their connector between those systems (built with Kafka Connect), so it's even easier to replicate their setup. The introductory post has many details about the implementation.

https://wecode.wepay.com/posts/kafka-bigquery-connector

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Introduction to Apache SparkR by Databricks (San Francisco) - Wednesday, August 31
http://www.meetup.com/SF-Big-Analytics/events/231624001/

Heron at Twitter (Santa Clara) - Wednesday, August 31
http://www.meetup.com/SF-Bay-Areas-Big-Data-Think-Tank/events/232756887/

Implementing Industry Use Cases with Apache Spark: Live Demos! (Palo Alto) - Wednesday, August 31
http://www.meetup.com/BigDataCloud/events/233293574/

Ohio

August Edition of MOHUG (Dublin) - Tuesday, August 30
http://www.meetup.com/MOHUG-Mid-Ohio-Hadoop-User-Group/events/232891865/

Maryland

Building Big Data Solutions in Azure Data Platform (Laurel) - Tuesday, August 30
http://www.meetup.com/Data-Science-MD/events/232936519/

Virginia

Building Big Data Solutions in Azure Data Platform (Ashburn) - Wednesday, August 31
http://www.meetup.com/NOVA-Data-Science/events/233016776/

CANADA

Toronto Apache Spark #12 (Toronto) - Wednesday, August 31
http://www.meetup.com/Toronto-Apache-Spark/events/233319018/

UNITED KINGDOM

Scala Manchester (Manchester) - Wednesday, August 31
http://www.meetup.com/scala-developers/events/232954560/

GERMANY

From Excel via SQL to MapReduce and SparkSQL by Carsten Langer (Dusseldorf) - Wednesday, August 31
http://www.meetup.com/Dusseldorf-Data-Science-Meetup/events/232748138/

HUNGARY

Sustaining the Future of Data (Budapest) - Monday, August 29
http://www.meetup.com/futureofdata-budapest/events/233365270/

Machine Learning with H2O (Budapest) - Friday, September 2
http://www.meetup.com/budapest_data_science/events/233393554/

INDIA

Understanding and Building Big Data Architectures, Part 3: Messaging/Kafka (Hyderabad) - Saturday, September 3
http://www.meetup.com/hyderabad-scalability/events/233253121/

Data Processing at Scale (Bangalore) - Saturday, September 3
http://www.meetup.com/Real-Time-Data-Processing-and-Cloud-Computing/events/233588239/

SRI LANKA

Digging into Big Data with Google's BigQuery and Apache Spark (Colombo) - Wednesday, August 31
http://www.meetup.com/LKBigData/events/233172669/

AUSTRALIA

The Evolution of Apache Hive and an Introduction to Apache Zeppelin (Sydney) - Monday, August 29
http://www.meetup.com/Big-Data-Analytics/events/233343265/

Meet the Founders: Alan Gates and Apache Hive (Melbourne) - Tuesday, August 30
http://www.meetup.com/Big-Data-Analytics-Meetup-Group/events/232689971/