Data Eng Weekly


Hadoop Weekly Issue #173

05 June 2016

It was a relatively quiet week with coverage of Spark, NiFi, Netflix's Meson, Storm, and more. Spark Summit is this week in San Francisco, so I'm sure there will be lots of great content for next week's issue (please send presentations my way!).

Technical

The Databricks blog has an overview of a new feature of the upcoming Apache Spark 2.0—cross-language support for storing and loading machine learning models. Models are saved and loaded with a simple API that stores metadata and parameters as JSON and data as Parquet.

https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html

Meson is Netflix's framework for executing machine learning workflows. It's the glue between a lot of big data technology like Apache Hive, Spark, and Mesos. Workflows are authored using a DSL, and Meson has a UI providing advanced visualizations of pipelines. Netflix hasn't open-sourced Meson yet, but they plan to in the future.

http://techblog.netflix.com/2016/05/meson_31.html

The IBM Hadoop Dev blog has a brief introduction and tutorial for using the HDFS archival storage capabilities.

https://developer.ibm.com/hadoop/2016/06/01/use-hdfs-archival-storage/

Apache Storm 1.0 has some impressive new features. This post looks at several of the improvements to debugging capabilities: dynamic log levels, a unified log search, event sampling, and integrated worker profiling via jstack/heap dumps/java flight recorder.

http://hortonworks.com/blog/whats-new-apache-storm-1-0-part-1-enhanced-debugging/

The Cloudera blog has a post describing how to use Apache Spark to do exploratory analysis of historical basketball statistics data stored in CSV files. Analysis is done using a mix of Scala and SQL.

http://blog.cloudera.com/blog/2016/06/how-to-analyze-fantasy-sports-using-apache-spark-and-sql/

Apache NiFi is gaining a lot of attention as a versatile tool. It's built for "flow based processing," which may not mean much to a lot of people. But NiFi supports standard ETL, stream processing and more. Many of the NiFi demos show moving data from the Twitter firehose to HDFS, but this one focusses on a different problem to demonstrate NiFi's versatility—pulling data via HTTP, performing some simple processing, and more.

http://hortonworks.com/blog/apache-nifi-not-scratch/

Amazon Redshift is built on the PostgreSQL engine, so you can leverage some PostgreSQL extensions to link a PostgresSQL instance with Redshift cluster. This enables some interesting applications like joining across databases, pulling Redshift results as JSON, creating materialized views of Redshift data in Postgres, and easily copying data between databases.

http://blogs.aws.amazon.com/bigdata/post/Tx1GQ6WLEWVJ1OX/JOIN-Amazon-Redshift-AND-Amazon-RDS-PostgreSQL-WITH-dblink

News

FeatherCast has posted audio of over 100 sessions from ApacheCon North America.

http://feathercast.apache.org/tag/apacheconna2016/

InfoWorld has an overview of Heron, Twitter's recently open-sourced and Apache Storm-compatible project. The post describes some of the architectural differences of the two projects. It also points out that Heron was started several months (and Storm releases) ago, which means Storm has been catching up on some of the most advantageous features of Heron.

http://www.infoworld.com/article/3078134/analytics/had-it-with-apache-storm-heron-swoops-to-the-rescue.html

Databricks is running a new course, "Introducion to Apache Spark," at edX. The course starts on June 15th and runs for two weeks.

https://databricks.com/blog/2016/06/01/databricks-to-launch-first-of-five-free-big-data-courses-on-apache-spark.html

Releases

Amazon EMR version 4.7.0 was announced. This release adds support for Apache Tez and Apache Phoenix, and it includs new versions of Apache HBase, APache Mahout, and Presto. The AWS Big Data blog has a tutorial for getting started with Phoenix.

http://aws.amazon.com/blogs/aws/amazon-emr-4-7-0-apache-tez-phoenix-updates-to-existing-apps/
http://blogs.aws.amazon.com/bigdata/post/Tx2ZF1NDQYDJFGT/Supercharge-SQL-on-Your-Data-in-Apache-HBase-with-Apache-Phoenix

Apache Hive 2.0.1 was released this week. It's the first dot release since 2.0.0 was released in February. The new version includes over 60 bug fixes.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201605.mbox/%3CD37344A3.77A64%25sershe@apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Apache Spark Maker Community Event (San Francisco) - Monday, June 6
http://www.meetup.com/streams/events/230766082/

Spark in Production and Spark HBase (San Francisco) - Monday, June 6
http://www.meetup.com/futureofdata-sanfrancisco/events/230698159/

Intel: Combining Spark and Open Source Elements (San Francisco) - Monday, June 6
http://www.meetup.com/San-Francisco-ODSC/events/231162698/

Tensorflow on Apache Spark & Ask Me Anything (San Francisco) - Monday, June 6
http://www.meetup.com/spark-users/events/230322011/

Streaming Data Pipelines with Containers (San Francisco) - Tuesday, June 7
http://www.meetup.com/SF-Big-Analytics/events/230753023/

Building Pipeline with Kafka Connect (Fremont) - Wednesday, June 8
http://www.meetup.com/datariders/events/230137371/

Illinois

Mike Keane: Integrating Flume and Kafka to Process > 100B Entries/Day (Chicago) - Thursday, June 9
http://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/230867233/

Wisconsin

Interactive Data Analysis with Apache Spark and Apache Zeppelin (Milwaukee) - Tuesday, June 7
http://www.meetup.com/MKE-Big-Data/events/230722728/

Pennsylvania

Solr, Spark and Zeppelin: The Analytics Toolkit for Distributed Big Data (Philadelphia) - Tuesday, June 7
http://www.meetup.com/futureofdata-philadelphia/events/231303725/

ISRAEL

Stream Processing with Apache Beam and Spark (Tel Aviv-Yafo) - Tuesday, June 7
http://www.meetup.com/Big-things-are-happening-here/events/231274864/

NEW ZEALAND

Spark Meetup Auckland June 2016 (Auckland) - Tuesday, June 7
http://www.meetup.com/Auckland-Apache-Spark-User-Group/events/231227124/