Data Eng Weekly


Hadoop Weekly Issue #131

26 July 2015

This week's theme is new projects—Cloudera announced Ibis for Hadoop-scale data science with Python, Spree provides an alternative web UI for Spark, and Astro adds HBase support to Spark SQL. In addition, Hortonworks released HDP 2.3 and Cassandra 2.2 adds a number of interesting new features. There are also several articles, which provide excellent guidance, for folks starting to build out systems in Spark, Kafka, and Flink.

Technical

This post describes using HyperLogLog (HLL) for estimating things like click-through rates for ads. To run at scale, the post describes how to use HLL from Spark, including how to build a custom Spark aggregation function. Finally, it describes a system that uses a pre-built/cached SparkContext to power a REST API server to run Spark queries with 1-2s response times across 100GB of data.

http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/

This post describes the tradeoff between throughput and latency in the Kafka Producer and how to tweak the various producer settings to achieve desired performance.

http://ingest.tips/2015/07/19/tips-for-improving-performance-of-kafka-producer/

The MapR blog has a great introduction to Hive's new support for transactions. The post describes some use cases, the supported semantics, how to enable transaction-support, and a brief overview of how it works (including how compactions clean up delta files).

https://www.mapr.com/blog/hive-transaction-feature-hive-10

Cloudera has announced a new project called Ibis, which has the goal of marrying Python data analysis/science libraries with Cloudera Impala to build a fast, native distributed framework. The project will including LLVM integration for code generation from Python code. The Cloudera blog has two posts about it—the first announced and describes the project goals and the second includes getting started instructions and more on contributing.

http://blog.cloudera.com/blog/2015/07/ibis-on-impala-python-at-scale-for-data-science/
http://blog.cloudera.com/blog/2015/07/getting-started-with-ibis-and-how-to-contribute/

With the caveat (as always) that benchmarks should be taken with a grain of salt, Pivotal has posted part 1 of a comparison of HAWQ (their SQL-on-Hadoop) to Hive and Impala. The tests were performed on a 15 node cluster with the 30TB TPC-DS dataset size. Among the highlights, HAWQ shows speed improvements over Hive and Impala, and it has full support for all of the TPC-DS queries (whereas Hive and Impala do not).

http://blog.pivotal.io/big-data-pivotal/products/performance-benchmark-pivotal-hawq-beats-impala-apache-hive-part-1

This post from the MapR blog compares MapReduce and Spark. It looks at the difference between the execution models, the differences in expressiveness (the Spark API having many more high-level operations like join and group by), and the library support (Spark includes machine learning, graph programming, and SQL as part of the core release).

https://www.mapr.com/blog/5-minute-guide-understanding-significance-apache-spark

This interview with some folks from dataArtisans about Apache Flink has some interesting details about the project. Topics include how it compares to Spark/Samza/Storm, Flink's approach to iterative processing, data streaming in Flink, and the Flink roadmap.

https://www.smaato.com/big-data-nosql-meetup-hamburg-with-apache-flink-at-smaato/

This post describes several parameters, including how to debug them, for best utilizing resources on a YARN cluster. In additional to the basic memory settings, the post covers the virtual and physical memory checker, several common exceptions (and their solutions), and a few MapR-specific settings.

https://www.mapr.com/blog/best-practices-yarn-resource-management

News

In comparison to the Spark vs. MapReduce article above, this post on Hadoop and Spark is aimed at a broader, non-technical audience. It describes the key differences between the two compute frameworks without getting too far into the implementation details.

https://www.linkedin.com/pulse/big-data-question-hadoop-spark-bernard-marr

The Wrangle Conference is a new event for data scientists taking place in San Francisco on October 22nd.

http://blog.cloudera.com/blog/2015/07/the-new-wrangle-conference-solving-the-hardest-data-science-challenges-from-startup-to-enterprise/

With another caveat about vendor benchmarks, ZDNet has a summary of a recent VMware benchmark that shows Hadoop in VMWare is just as fast as bare-metal, even when running 4 VMs per instance.

http://www.zdnet.com/article/virtualized-hadoop-a-brief-look-at-the-possibility/

InfoWorld has an article about Hadoop at Yahoo. Topics include scale (43K servers in 20 YARN clusters, 600PB of storage, and 33 million jobs per month) and which systems they use (HBase, Spark, Oozie, Pig, Storm, Tez, Hive). Yahoo's Hadoop usage is over half Pig and only a small percentage (1%) Spark. Interestingly, though, Yahoo expects to be off of MapReduce by the end of year (in favor of Tez or Spark).

http://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html

Releases

Apache Samza 0.9.1 was released on the 13th. This is a bug-fix release that resolves 7 tickets.

https://blogs.apache.org/samza/entry/announcing_the_release_of_apache3

Apache Cassandra 2.2.0 was released this week. This release adds Windows as a supported platform, support for roles in authentication and authorization APIs, off-heap row-cache, and much more. The NEWS.txt in the announcement includes upgrade instructions.

http://www.mail-archive.com/user@cassandra.apache.org/msg43274.html

Version 2.3 of the Hortonworks Data Platform was released this week. The release contains new versions of nearly all components and adds Atlas and Cloudbreak to the list of supported projects. The Hortonworks blog has highlights of the new features in Spark, Hive, stream processing, and more

http://hortonworks.com/blog/available-now-hdp-2-3/
http://hortonworks.com/blog/introducing-availability-of-hdp-2-3-part-2/

Astro is an open-source project from Huawei that adds support for HBase to Spark SQL. Astro requires Spark 1.4.0.

http://pr.huawei.com/en/news/hw-445312-astro.htm
https://github.com/Huawei-Spark/Spark-SQL-on-HBase

Spree is a new project that provides an alternative, auto-updating web UI for Spark. Information about the individual tasks is forwarded from Spark using a custom DAGScheduler listener to slim, which is a node.js server that writes to Mongo. Spree uses the Meteor web framework and React to auto-update the UI.

http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/

Events

Curated by Datadog ( http://www.datadoghq.com )

UNITED STATES

California

Real-time Advanced Analytics: Spark Streaming+Kafka, MLlib/GraphX, SQL/DataFrames (San Francisco) - Tuesday, July 28
http://www.meetup.com/Advanced-Apache-Spark-Meetup/events/223763502/

Spark and Verizon (San Jose) - Tuesday, July 28
http://www.meetup.com/spark-users/events/223901803/

Apache Ambari and Its Role in the Open Data Platform (Palo Alto) - Wednesday, July 29
http://www.meetup.com/Open-Data-Platform-Group/events/223776200/

Twitter Heron: Stream Processing at Scale (San Francisco) - Thursday, July 30
http://www.meetup.com/streams/events/223606092/

The Future of Data, with Doug Cutting (Sunnyvale) - Thursday, July 30
http://www.meetup.com/Bay-Area-Cloudera-User-Group/events/223110554/

Minnesota

Spark 101 (Saint Paul) - Thursday, July 30
http://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/223810622/

Illinois

Reactive Stream Processing with Kafka-Rx (Chicago) - Tuesday, July 28
http://www.meetup.com/Chicago-Spark-Users/events/223743059/

Wisconsin

Hands-on Demonstration of Apache Drill (Madison) - Tuesday, July 28
http://www.meetup.com/BigDataMadison/events/220143068/

Ohio

Cleveland Big Data and Hadoop User Group (Mayfield Village) - Monday, July 27
http://www.meetup.com/Cleveland-Hadoop/events/222813402/

Florida

Apache Spark: What Is All the Hype About? (Saint Petersburg) - Wednesday, July 29
http://www.meetup.com/Tampa-Hadoop-Meetup-Group/events/222808783/

North Carolina

F#/Analytics: Much Ado about Hadoop (Cary) - Tuesday, July 28
http://www.meetup.com/TRINUG/events/222701746/

Best Practices on Building a Hadoop Data Lake Solution (Charlotte) - Wednesday, July 29
http://www.meetup.com/CharlotteHUG/events/219153228/

New York

Double Feature: Hadoop & Java 8 Stream Debugging (New York) - Monday, July 27
http://www.meetup.com/nycjava/events/222886516/

COLOMBIA

Data Crunching with Spark + Data Warehouses in Business (Bogota) - Thursday, July 30
http://www.meetup.com/Big-Data-Science-Bogota/events/223541361/

UNITED KINGDOM

Real-time Stream Processing with Batch Analytics (Manchester) - Wednesday, July 29
http://www.meetup.com/HadoopManchester/events/223625623/

GERMANY

Flink 0.10 & Comparing Flink to Other Streaming Systems (Berlin) - Wednesday, July 29
http://www.meetup.com/Apache-Flink-Meetup/events/223574589/

ISRAEL

Hadoop Ecosystem (Haifa) - Tuesday, July 28
http://www.meetup.com/Haifa-Whats-new-with-BIG-DATA-Meetup/events/223461399/

INDIA

Apache Spark: An Open-Source Revolution Comes to Big Data Analytics (Bangalore) - Friday, July 31
http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/223831548/

A Deep Dive into Spark Dataframe API (Bangalore) - Saturday, August 1
http://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/223706897/

AUSTRALIA

Spark 1.4 Announcement + Spark Summit Recap + Tableau Spark Driver (Sydney) - Monday, July 27
http://www.meetup.com/Sydney-Apache-Spark-User-Group/events/223453539/