Data Eng Weekly


Hadoop Weekly Issue #193

13 November 2016

Welcome to a double-issue of Hadoop Weekly. There's lots of breadth in this week's issue—from Apache Avro to Apache Spark and everything in between.

Technical

The Cloudera blog has a post describing how the new Apache Oozie database migration tool works to maintain job configuration and history during Oozie upgrades.

http://blog.cloudera.com/blog/2016/11/how-to-use-the-new-apache-oozie-database-migration-tool/

MapR's latest whiteboard walkthrough covers Apache Drill's query optimizer. Built on Apache Calcite, the optimizer implements rule-based (e.g. projection push-down, partition pruning) optimizations as well as cost-based (to do things like reorder joins).

https://www.mapr.com/blog/apache-drill-sql-query-optimization-whiteboard-walkthrough

The morning paper has a recap of a 2015 paper from Databricks about some of the changes they implemented in Spark based on their customers' experience. While there are some things that have been covered elsewhere (e.g. the optimizations), there's also discussion of some internals like their switch to netty and assumptions about HDFS block sizes that I hadn't before come across. https://blog.acolyer.org/2016/11/04/scaling-spark-in-the-real-world-performance-and-usability/amp/

For the distributed systems folks, this is an interesting presentation on Flexible Paxos—i.e. the ability to reach consensus without majorities.

http://www.slideshare.net/heidiannhoward/flexible-paxos-reaching-agreement-without-majorities

This month's Log Compaction post, which covers news in the Apache Kafka community, has a description of several underway Kafka improvements (including improvements for multi-tenancy), as well as links to posts on Kafka at Walmart, Unit Testing Kafka, and a great explanation of encryption for Kafka messages.

https://www.confluent.io/blog/bloglog-compaction-highlights-in-the-apache-kafka-and-stream-processing-community-november-2016/

The IBM Hadoop Dev blog has a post highlighting several presentations from the recent World of Watson conference. The speakers covered various themes in healthcare, fraud detection, and marketing.

https://developer.ibm.com/hadoop/2016/11/06/biginsights-premises-cloud-customer-use-cases-world-watson-conference-2016/

This presentation gives an introduction to Hivemall, which is a new Apache incubator project for machine learning on Apache Spark, Apache Hive, and Apache Pig. It's been around outside of the ASF for quite some time, though, and it has a fairly impressive feature set. The presentation describes use cases and example syntax for training and prediction.

http://www.slideshare.net/myui/dots20161029-myui

Apache Avro is a well-supported file format throughout the Hadoop ecosystem due to its compact encoding and support for schema evolution. This post describes how it can be used with Hive, including how to add or remove columns from the Hive definition in a backwards/forwards-compatible way.

http://getindata.com/blog/post/schema-evolution-with-avro-and-hive/

Databricks has announced a new documentation resource for their own product as well as for Apache Spark. The Spark materials includes tutorials, a SQL language manual, training materials, and more.

https://databricks.com/blog/2016/11/10/databricks-launches-comprehensive-guide-product-apache-spark.html

The Hortonworks blog has a post that describes Apache MiNiFi and outlines several use cases. MiNiFi aims to run where data is collected and can be either a C++ or Java agent.

http://hortonworks.com/blog/edge-intelligence-iot-apache-minifi/

Big Data Labs has a number of interesting Spark tutorials and use cases. This week, there's a new walkthrough on analyzing Capital Bikeshare historical trip data using a number of Spark's machine learning libraries.

http://clouddatalab.com/index.html

News

This post has a look at the trade-offs and SLAs for Google's various storage and blob storage tiers (such as regional, nearline, and coldline). The author pulls together public details about Google's infrastructure and adds a bit of speculation to talk about how the various tiers are likely implemented.

https://www.nextplatform.com/2016/10/28/learning-googles-cloud-storage-evolution/

SearchDataManagement has an article about how several companies are using Apache Spark for use cases ranging from web site personalization to bank analytics. Even in its adolescent state, Spark is gaining pretty wide adoption.

http://searchdatamanagement.techtarget.com/feature/Functionality-gaps-not-stopping-Spark-usage-from-growing-fast

DataStax made some news in the open-source community last week by saying that many of their developers will be focussing on DataStax Enterprise rather than Apache Cassandra.

http://www.datastax.com/2016/11/serving-customers-serving-the-community

The Confluent blog has a post describing the history of non-JVM clients for Apache Kafka, the work that was done for simplifying the client protocol (so that clients don't depend on ZooKeeper), and Confluent's progress towards using the C-based client to power other non-JVM languages (like Python and Go).

https://www.confluent.io/blog/confluent-contributions-to-the-apache-kafka-client-ecosystem

Hortonworks reported earnings for Q3. They lost $64.7 million on $47.5 million in revenue.

https://finance.yahoo.com/news/hortonworks-reports-3q-loss-215553860.html

Qubole and Oracle have announced that the Qubole Data Service is now generally available on the Oracle Bare Metal Cloud Service.

https://www.qubole.com/blog/product/qds-on-oracle-bare-metal-cloud-service-generally-available/

Flink Forward is taking place in San Francisco in April. Call for papers opens soon.

http://sf.flink-forward.org/

Releases

Amazon EMR 5.1.0 was recently released, and it's the first version in which Apache Flink is natively supported.

https://aws.amazon.com/blogs/big-data/use-apache-flink-on-amazon-emr/

Altiscale has announced that they're supporting ACID transactions for Apache Hive on their Hadoop-as-a-Service platform.

https://www.altiscale.com/blog/hive-transactions-feature-now-on-altiscale/

Apache Fluo (incubating) is a system based on Google's Percolator for performing incremental updates on data stored in Apache Accumulo. Version 1.0.0-incubating was recently released.

https://fluo.apache.org/release/fluo-recipes-1.0.0-incubating/

Version 0.1.0-incubating of Apache S2Graph was released this week. S2Graph is a distributed graph processing system with a REST API, bulk loader, and more. It uses Apache HBase for storage. https://lists.apache.org/thread.html/5026e4616f844d58295ff08a5b7c819afba6f2dfe2b02b22c455814e@%3Cdev.s2graph.apache.org%3E

Cloudera Labs has announced support for version 0.10.0 of YCSB, the benchmarking tool for NoSQL databases. There are a number of changes including support for Apache Solr, Google Cloud Datastore and Bigtable, and more.

http://blog.cloudera.com/blog/2016/11/ycsb-0-10-0-now-in-cloudera-labs/

Apache Knox, a REST API Gateway for the Hadoop ecosystem, version 0.10.0 was released this week. The release includes improvements to LDAP, PAM support, and Websocket support.

https://cwiki.apache.org/confluence/display/KNOX/News#News-2016-11-07ApacheKnoxGateway0.10.0Released!

Apache Spark 1.6.3, the latest maintenance release in the 1.x family, was announced this week. It contains over 35 bug fixes and a number of improvements.

http://spark.apache.org/news/spark-1-6-3-released.html

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Big Data Science Meetup (Mountain View) - Monday, November 14
https://www.meetup.com/Big-Data-Science/events/227573224/

Stream Computing: The Engineer’s Perspective (San Francisco) - Tuesday, November 15
https://www.meetup.com/SF-Big-Analytics/events/234755633/

#OCBigData Meetup #20 (Irvine) - Wednesday, November 16
https://www.meetup.com/OCBigData/events/235153309/

Architecture of an Open Source RDBMS Powered by HBase and Spark (Mountain View) - Wednesday, November 16
https://www.meetup.com/sv-jug/events/233592105/

Airflow Meetup (Redwood City) - Wednesday, November 16
https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/234778571/

Pulsar: Distributed Pub-Sub Messaging & Apache NiFi in Action (Mountain View) - Thursday, November 17
https://www.meetup.com/openvswitch/events/234920532/

Washington

Building Recommendation Systems in Python Using Apache Spark (Seattle) - Tuesday, November 15
https://www.meetup.com/Seattle-Data-Engineering/events/235309624/

Security and Machine Learning with Apache Spark (Seattle) - Wednesday, November 16
https://www.meetup.com/Seattle-Spark-Meetup/events/230290257/

Texas

H2O Sparkling Water on Azure Using HDInsight Spark (Dallas) - Wednesday, November 16
https://www.meetup.com/Microsoft-Dallas-Big-Data-Science/events/235026285/

HA Spark Streaming with DataStax Enterprise and Confluent (Houston) - Wednesday, November 16
https://www.meetup.com/Houston-Functional-Programmers/events/233603650/

Minnesota

Apache Kudu with Kudu Founder Todd Lipcon (Saint Paul) - Thursday, November 17
https://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/235429257/

Ohio

Harnessing Data Within Hadoop in the Connected World (Cincinnati) - Tuesday, November 15
https://www.meetup.com/South-Ohio-Hadoop-Users-Group-SOHUG/events/234925359/

Future of Data: Cincinnati (Cincinnati) - Thursday, November 17
https://www.meetup.com/futureofdata-cincinnati/events/235306103/

Georgia

How a Streams-First Architecture Enables Real-Time Big Data (Atlanta) - Wednesday, November 16
https://www.meetup.com/Atlanta-Hadoop-Users-Group/events/234803403/

North Carolina

November CHUG: Igniting Audience Measurement at Charter (Charlotte) - Wednesday, November 16
https://www.meetup.com/CharlotteHUG/events/227294076/

CANADA

Introduction to HDInsight (Vancouver) - Wednesday, November 16
https://www.meetup.com/NET-User-Group-of-BC/events/235341994/

SPAIN

Big Data in AWS (Madrid) - Wednesday, November 16
https://www.meetup.com/Innovative-technology-BEEVA/events/235320885/

Beyond Shuffling & Streaming Preview, by Holden Karau (Barcelona) - Thursday, November 17
https://www.meetup.com/Spark-Barcelona/events/235421242/

GERMANY

Typescript & Flow + Apache Spark + Jigsaw (Kiel) - Thursday, November 17
https://www.meetup.com/Nordic-Coding/events/234688614/

NETHERLANDS

PyData Amsterdam: The H20 Edition (Amsterdam) - Wednesday, November 16
https://www.meetup.com/PyData-NL/events/235348933/

HUNGARY

Big Data Meetup: November 2016 (Budapest) - Tuesday, November 15
https://www.meetup.com/Big-Data-Meetup-Budapest/events/235160346/

CROATIA

Operational Analytics Using Spark and Storm (Zagreb) - Tuesday, November 15
https://www.meetup.com/Apache-Spark-Zagreb-Meetup/events/234963668/

ISRAEL

The Best of Hadoop Summit 2016 + Screening of Doctor Strange (Tel Aviv-Yafo) - Tuesday, November 15
https://www.meetup.com/cloudzone-academy/events/235126709/

AUSTRALIA

November Meetup (Brisbane) - Tuesday, November 15
https://www.meetup.com/Brisbane-Net-User-Group/events/235459464/