Data Eng Weekly


Hadoop Weekly Issue #218

29 May 2017

Short and sweet issue this week, covering Spark's structured streaming, HDFS's new Maintenance State, data exploration tools at Stitch Fix, new products from Cloudera, MapR, and Databricks, and more.

Technical

Spark's structured streaming has a "ProcessingTime" trigger that will attempt to process new data at regular intervals (like cron). For a cluster that is elastic in size, this can save money by only bringing up the necessary resources when the trigger fires. With that said, jobs can still be stateful, and structured streaming has a few other features (such as bookeeping of failures and table-level atomicity) that make it more attractive than a normal batch operation.

https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html

The Cloudera blog has an overview of a new feature in HDFS call the "Maintenance State." Essentially, it provides a mechanism for temporarily removing nodes from the cluster without causing a replication storm (this can be useful for e.g. patching an entire rack at a time). This feature requires a new "maintenance" file (the dfs.hosts file format isn't rich enough) that is JSON-like. The post has more details on the implementation and how to use it (in CDH 5.11+, at least).

http://blog.cloudera.com/blog/2017/05/hdfs-maintenance-state/

The Algorithms & Analytics team at Stitch Fix has written about their data exploration tool, Dora. The data system is backed by an Elasticsearch cluster, whose data is generated by Spark from data in S3.

http://multithreaded.stitchfix.com/blog/2017/05/23/building-a-data-exploration-tool-with-react/

Hadoop, Spark, and the broader ecosystem offer the ability to process complex data with nested structs, arrays, maps, and more. Support for this complex data is great in a programmatic setting, but it's more tricky to use from SQL. This post looks at the TRANSFORM operation and other "Higher Order Functions" that have been added to Spark SQL. This feature is available in the Databricks 3.0 beta, and there's a JIRA ticket open (SPARK-19480) to add it to Spark core.

https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html

This post provides an overview and comparison of Kafka Connect and StreamSets data collector. Both tools are capable of shuffling data between systems, which is the main focus of the comparison.

https://www.linkedin.com/pulse/kafka-connect-vs-streamsets-advantages-disadvantages-slim-baltagi

In another comparison with Kafka, this post provides a high-level overview of the similarities and differences between Kafka and Amazon Kinesis. It primarily looks at the system-level (primitives like topics, streams, partitions and shards) and getting data into the system.

http://dataconomy.com/2017/05/kinesis-kafka-big-data-analysis/

News

Cloudera has announced their first hosted service, Cloudera Altus. It's a "Data Engineering service" that takes care of provisioning clusters and running jobs in an existing AWS account. The post has more details—at first glance, it resembles many other Hadoop as a service offerings, so it'll be interesting to see where Cloudera tries to differentiate.

http://blog.cloudera.com/blog/2017/05/data-engineering-with-cloudera-altus/

Databricks has announced the Databricks Runtime 3.0 beta. Based on Apache Spark 2.2.0 release candidates, it also includes improvements to S3 throughput, better performance, and support for transactional writes to S3.

https://databricks.com/blog/2017/05/24/databricks-runtime-3-0-beta-delivers-enterprise-grade-apache-spark.html

The Apache Knox team disclosed CVE-2017-5646: "Apache Knox Impersonation Issue for WebHDFS." Users are encouraged to upgrade to Apache Knox 0.12.0.

https://lists.apache.org/thread.html/f4930844052bf4fb7a0435c1779b50f54211bd8d447b31dc2b10f112@%3Cannounce.apache.org%3E

ZDNet has coverage of MapR's new deep learning product, Quick Start Solution (QSS).

http://www.zdnet.com/article/artificial-intelligence-on-hadoop/

Releases

Apache NiFi 0.7.3 was released with reliability, performance, and other fixes.

https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version0.7.3

Version 0.4.0 of Apache Arrow, the in-memory columnar data layer for a number of Hadoop ecosystem projects, was released. Highlights include a beefed up JavaScript implementation, Windows Python Support, and more.

https://arrow.apache.org/blog/2017/05/23/0.4.0-release/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

Texas

Talend Presents Sensors, Spark and Kafka: Applied Machine Learning (Addison) - Tuesday, May 30
https://www.meetup.com/DFW-BigData/events/239361814/

Florida

Tracking Trains in Real Time Using Stream Processing in Apache Kafka and Storm (Jacksonville) - Tuesday, May 30
https://www.meetup.com/jaxbigdata/events/240063054/

Pennsylvania

Large-Scale Text Processing Pipeline With Spark ML and GraphFrames (Philadelphia) - Thursday, June 1
https://www.meetup.com/Philadelphia-Hadoop-User-Group/events/239666223/

CANADA

Toronto Apache Spark #20 (Toronto) - Wednesday, May 31
https://www.meetup.com/Toronto-Apache-Spark/events/239840844/

UNITED KINGDOM

Using Apache NiFi to Empower Self-Organizing Teams (London) - Wednesday, May 31
https://www.meetup.com/futureofdata-london/events/240052173/

SPAIN

Discover Khermes, an Open-Source & Distributed Data Generator for Apache Kafka (Madrid) - Thursday, June 1
https://www.meetup.com/apachekafkamadrid/events/240052933/

BELGIUM

Big Data Analytics (Kontich) - Wednesday, May 31
https://www.meetup.com/Belgium-Cloudera-User-Group/events/239354322/

NETHERLANDS

7th Recommender Systems Amsterdam Meetup (Amsterdam) - Tuesday, May 30
https://www.meetup.com/Recommender-Systems-Amsterdam/events/238564357/

GERMANY

Our First Kafka Meetup with 2 Amazing Speakers Form Confluent (Walldorf) - Tuesday, May 30
https://www.meetup.com/Frankfurt-Apache-Kafka-Meetup-by-Confluent/events/240032383/

ITALY

Apache Spark: A Unique Engine for Big Data Processing (Milan) - Thursday, June 1
https://www.meetup.com/Big-Data-Cloudera-Ecosystem-Milano/events/240052644/

AUSTRIA

DataCamp Vienna: Spring Edition (Vienna) - Tuesday, May 30
https://www.meetup.com/Austrian-Cloud-and-Big-Data-Forum/events/238530740/

HUNGARY

Building Streaming Data Pipelines (Budapest) - Wednesday, May 31
https://www.meetup.com/futureofdata-budapest/events/239821158/

INDIA

Workshop on Spark 2.x (Pune) - Saturday, June 3
https://www.meetup.com/Pune-Big-Data-Conference-Group/events/240243914/

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit https://hadoopweekly.com