Data Eng Weekly


Hadoop Weekly Issue #231

04 September 2017

Quite a few releases this week, including a few from Kafka Summit that took place this past week in San Francisco. There are also great articles in this issue covering Kafka, Spark, Pulsar, Flink, and more.

Technical

This post describes how to visualize Kafka consumer offsets by exporting them over HTTP and using Prometheus to send them to Grafana.

https://blog.godatadriven.com/monitoring-kafka-consumer-lag

Evaluating a model that was trained offline in a realtime setting can be a tricky venture. One popular solution is PMML, which has some limitations, but generally works well for a certain set of use cases. The Red Ventures Data Science & Engineering team has written about their experience with MLeap, which is a new alternative to PMML with builtin support for Apache Spark.

https://medium.com/rv-data/mleap-providing-near-real-time-data-science-with-apache-spark-c34e7df093ca

Confluent has announced a new project, called KSQL, for running SQL queries across a Kafka cluster. KSQL differentiates between Streams (sequences of facts) and Tables (materialized streams) in its query capabilities, and it offers rich support for windowing. KSQL is in preview, and the code is open sourced under the Apache License.

https://www.confluent.io/blog/ksql-open-source-streaming-sql-for-apache-kafka/

In part two of their series on Apache Pulsar (incubating), the Streamlio blog describes several important components of Pulsar including its I/O isolation, scalability, security model, multi-language (C++, Java, and Python) API, and operational maturity. Since Pulsar was in use at Yahoo, it has several enterprise features.

https://streaml.io/blog/why-apache-pulsar-part-2/

PySpark is getting better support for MLlib algorithms. To enable this, there is a nwe PySpark implementation of the persistence framework that was previously Scala-only. This post has more details on the solution and how it fits into a machine learning pipeline.

https://databricks.com/blog/2017/08/30/developing-custom-machine-learning-algorithms-in-pyspark.html

The Databricks blog also has a great overview of Apache Spark 2.2's Cost-Based Optimizer. The post describes the statistics that feed into the optimizer (including how to collect them), examples of types of optimizations it's able to perform, and some experimental results comparing query time with and without the CBO on the TPC-DS benchmark.

https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html

This post provides a brief introduction to running an Apache Flink streaming job on Mesos using DC/OS.

https://mesosphere.com/blog/stream-processing-apache-flink/

The AWS blog has a tutorial for how to use Amazon CodePipline, AWS Service Catalog, and AWS CodeBuild to run a CI/CD setup for an Apache Spark job.

https://aws.amazon.com/blogs/big-data/implement-continuous-integration-and-delivery-of-apache-spark-applications-using-aws/

This post shows how to use Amazon IoT with Amazon Kinesis and Apache Spark to build a streaming IoT application. It includes the necessary AWS configs as well as sample Spark code.

http://www.qubole.com/blog/iot-with-amazon-kinesis-spark-streaming-on-qubole/

News

The new O'Reilly book on Streaming Systems is in preview. Authored by several Google software engineers, the Google Cloud Platform has a Q&A with one of them, Tyler Akidau. The book is in early release.

https://cloud.google.com/blog/big-data/2017/08/the-canonical-new-book-about-stream-processing

Flink Forward is just a week away—it takes place in Berlin September 11-13. Among others, Netflix, Alibaba, and ING are presenting.

https://berlin.flink-forward.org/

Releases

Apache Samza, the stream processing system, announced version 0.13.1. The release includes a few enhancements and bug fixes covering 29 JIRA tickets.

https://blogs.apache.org/samza/entry/announcing-the-release-of-apache2

Microsoft Azure has announced availability of the Hortonworks Cloudbreak service for provisioning Hortonworks Data Platform clusters. A single Cloudbreak Controller VM can manage multiple clusters and automatically configure both Kerberos and Apache Knox to secure the cluster. Cloudbreak is available via the Azure Marketplace.

https://azure.microsoft.com/en-us/blog/hortonworks-extends-iaas-offering-on-azure-with-cloudbreak/

Cask has announced version 4.3 of CDAP. There is a lengthy overview of the new features, which include new features for data preparation, ETL, Apache Ranger integration, and Spark Dataframe support.

http://blog.cask.co/2017/08/announcing-ga-release-of-cdap-4-3/

MapR has announced the new MapR Orbit Cloud Suite which provides cross-cloud functionality (combinations of public and private clouds), object-tiering (which can offload certain data to cloud object storage), and cloud-native management (provisioning of VMs in AWS and Microsoft Azure).

https://community.mapr.com/community/products/blog/2017/08/29/introducing-the-mapr-orbit-cloud-suite

StreamSets 2.7.1.0 has adds new support for Microsoft Azure, in addition to fixes and other improvements.

https://streamsets.com/blog/announcing-data-collector-v2-7-1-0/

Kafka Lenses is a new suite of enterprise tools for Kafka. It includes a web inspector for Kafka SQL, Kafka Connectors, and more. It also provides a operational insights into Kafka clusters, such as exploring partition offsets and managing consumers.

http://www.landoop.com/blog/2017/08/kafka-lenses/
http://kafka-lenses.io/

Given that Kafka got its start at LinkedIn, it's no surprise that they have a lot of clusters and some great tooling for those clusters. This week, they open sourced Cruise Control, which is a system for monitoring and tuning Kafka clusters. For example, it will replace a cluster when it fails, de-comission brokers, and keep clusters in balance with respect to disk/network/cpu. The introductory blog post describes design goals and future work.

https://engineering.linkedin.com/blog/2017/08/open-sourcing-kafka-cruise-control
https://github.com/linkedin/cruise-control

On the heels of their graduation from the Apache Incubator, Apache MADlib has released version 1.12. The new release of the machine learning on SQL library adds a number of graph algorithms, includes improvements to decision tree and random forest implementations and has better support for summary and sketch calculations.

https://lists.apache.org/thread.html/eb4a773acc90a68a4af306e106670713f6e105bcd2b5dff520391604@%3Cannounce.apache.org%3E

Apache Atlas 0.8.1 was released.

https://lists.apache.org/thread.html/82337a63dd216dbfa4f4609f76ceaef30de79e68dcbf726a673539b9@%3Cannounce.apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Kafka Steams: Are Your Streams Keeping Up? Monitoring for a Streaming World (San Francisco) - Tuesday, September 5
https://www.meetup.com/San-Francisco-Bay-Area-Big-Data-and-Scalable-Systems/events/242590881/

Bay Area Apache Spark Meetup (Santa Clara) - Thursday, September 7
https://www.meetup.com/spark-users/events/242368409/

Apache Spark and Apache Ignite: Where Fast Data Meets the IoT (Santa Clara) - Saturday, September 9
https://www.meetup.com/datariders/events/242523245/

Missouri

Back to School: Hadoop 101 (Saint Louis) - Wednesday, September 6
https://www.meetup.com/St-Louis-Hadoop-Users-Group/events/238931748/

BRAZIL

Code, Beer, Repeat v3.0: Kafka and Xamarin (Sao Paulo) - Tuesday, September 5
https://www.meetup.com/code-beer-repeat/events/242542709/

GERMANY

Jay Kreps and Kai Waehner Talk about Kafka (Munich) - Thursday, September 7
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/242268741/

Apache Flink Meetup (Berlin) - Thursday, September 7
https://www.meetup.com/Apache-Flink-Meetup/events/242278974/

AUSTRIA

Hadoop User Group Meetup (Vienna) - Wednesday, September 6
https://www.meetup.com/Hadoop-User-Group-Vienna/events/242215692/

POLAND

New Generation Integration with NiFi, Kylo + Spark SQL Internals (Krakow) - Wednesday, September 6
https://www.meetup.com/datakrk/events/242823908/

SINGAPORE

Discussion and Open Space for Emerging Big Data and Analytics Technologies (Singapore) - Friday, September 8
https://www.meetup.com/AnalyticsTech/events/241398814/