Data Eng Weekly


Hadoop Weekly Issue #135

23 August 2015

As more companies adopt Spark, several are using it without a Hadoop cluster. An article in this week's issue describes how Grammarly runs Spark on EC2 (with S3 for storage), and Mesosphere has announced a new product that marries Spark, Mesos, Cassandra, and more (but no HDFS). On the topic of Mesos, there are also articles this week on Chronos (scheduler for Mesos) and running Mesos on a CDH cluster. In addition to these, there are plenty of other great links this week, such as the slides describing the Hadoop shell script rewrite and the BigTop 1.0.0 release announcement.

Technical

The latest of the Call me Maybe series looks at Chronos, which is a task scheduler for Mesos clusters. The series evaluates distributed systems in the presence of failure using the Jepsen framework. Even for folks without a lot of background knowledge of Mesos or Chronos, it's easy to follow the post and understand the results.

https://aphyr.com/posts/326-call-me-maybe-chronos

The team at Grammarly recently switched from Amazon EMR with Pig to Apache Spark (using the spark-ec2 scripts). This post describes their migration and a number of fixes to issues that they ran into while deploying Spark on EC2. Several of the fixes relate to the S3 Hadoop FileSystem implementation, but there are also a few related to memory management. Overall, it's a fantastic overview for anyone running Spark... but particularly so if you're on EC2.

http://tech.grammarly.com/blog/posts/Petabyte-Scale-Text-Processing-with-Spark.html

This post is a quick introduction to Apache Gora, which is a framework to serialize in-memory data models for storage in systems like HBase, Cassandra, and MapReduce. The post covers Gora's main components, Gora's features, and the datastores supported by Gora.

http://www.javacodegeeks.com/2015/08/in-memory-data-model-and-persistence-for-big-data.html

This week, the morning paper covered articles on Google Cloud Dataflow, Apache Flink streaming, and Google's Millwheel stream processing system. As usual, each post has an extended summary of a paper, which provides enough background to understand the key contributions of the paper.

http://blog.acolyer.org/2015/08/18/the-dataflow-model-a-practical-approach-to-balancing-correctness-latency-and-cost-in-massive-scale-unbounded-out-of-order-data-processing/
http://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/
http://blog.acolyer.org/2015/08/21/millwheel-fault-tolerant-stream-processing-at-internet-scale/

Hive has a few different types of User Defined Functions, which have been covered in a series on the Beekeeper Data blog. This is the third post in the series, and it covers the User Defined Aggregate Function (UDAF) version of the API. For the walkthrough, the post implements a method to sum and count the total number of letters in a column of strings. The are a few subtleties of the API, which are explained alongside example code.

http://beekeeperdata.com/posts/hadoop/2015/08/17/hive-udaf-tutorial.html

This post describes a number of important algorithms and concepts that are often used in streaming algorithms. The topics covered are hashing and sketching, hyper log log, count min sketch, streaming k-means, and quantiles via t-digest. Each of the concepts is explained both in writing and graphically.

https://www.mapr.com/blog/some-important-streaming-algorithms-you-should-know-about

The Cloudera blog has a guest post from Big Industries that details how to run Apache Mesos using Cloudera Manager. To achieve this, Big Industries has built a Custom Service Descriptor and published a parcel repository. After configuring Cloudera Manager for these, it's fairly easy to launch an application from a docker image via the Mesos/Marathon integration. All of the source code is available on github.

http://blog.cloudera.com/blog/2015/08/how-to-run-apache-mesos-on-cdh/

Spotify has been moving a number of their big data applications from python to Scala. This presentation describes the tools and libraries (such as scalding, annoy, and bloom filters) behind recommendations and other big data applications at Spotify.

http://www.slideshare.net/sinisalyh/scala-data-pipelines-spotify

If you've ever battled with the Hadoop shell code, you're probably familiar with how difficult and obtuse it can be (even getting the help text has been famously difficult). This presentation describes the rewrite of the Hadoop shell code which was committed to trunk just over a year ago (but hasn't yet been part of a release). The rewrite unifies a lot of functionality/scripts, improves usability and operability, and adds a number of new features like .hadooprc for client config, the ability to customize internal functions (like log rotation), and support for starting secure daemons as non-root users.

https://speakerdeck.com/allenwittenauer/apache-hadoop-shell-rewrite

The Hortonworks blog has a post on fault tolerance for Nimbus, which is the daemon responsible for managing topologies in an Apache Storm cluster. The post describes the active-standby Nimbus architecture across the areas of leader election, state storage, and leader discovery.

http://hortonworks.com/blog/fault-tolerant-nimbus-in-apache-storm/

With Apache Drill in standalone mode, it's incredibly easy to convert a CSV file to another format like Apache Parquet. This walkthrough describes how to do just that—it's a simple query that scales with the number of columns in the input CSV dataset.

https://www.mapr.com/blog/how-convert-csv-file-apache-parquet-using-apache-drill

Hue is a web interface for Hadoop, and it's recently gained support for Spark. The Spark application includes interactive querying with Scala, Python, and R using the Livy job submission server. The Hue team has published slides about the architecture of the new integration, which is in its third iteration.

http://gethue.com/big-data-scala-by-the-bay-interactive-spark-in-your-browser/

News

This post has an interesting visualization of all of Hadoop releases over time. From the graphics, it's easy to see a few pattern. For example, release velocity slowed in 2009, and there were some big lapses in between releases (e.g. 0.21.0 to 0.20.203.0).

http://www.allanalytics.com/author.asp?doc_id=278430&section_id=3619

This post articulates concerns about the community of some Hadoop-related projects, which are dominated by a single vendor.

http://www.zdnet.com/article/hadoop-veteran-ted-dunning-when-open-source-is-anything-but-open/

MapR has been available on Amazon Web Services via Amazon EMR for some time, but this week MapR added the ability to deploy via the AWS Marketplace. Using the Marketplace application, it's possible to start a cluster from pre-built Amazon Machine Images and take advantages of features not supported in EMR.

https://awsinsider.net/articles/2015/08/19/mapr-hadoop.aspx

Forbes has a look at Confluent, the commercial company behind Apache Kafka. The article talks about the origins of the technology at LinkedIn, some of the companies using Kafka, several Confluent customers, and the next release of the Confluent Platform (scheduled for October).

http://www.forbes.com/sites/alexkonrad/2015/08/19/why-confluent-data-has-silicon-valley-buzzing/

An article on ZDNet highlights "five open-source big data projects to watch." There's a quick blurb on each of them: Apache Flink, Apache Samza, Ibis, Apache Twill (incubating), and Apache Mahout.

http://www.zdnet.com/article/five-open-source-big-data-projects-to-watch/

Spark Summit EU is October 27-29 in Amsterdam. Day 1 is training, and days 2 and 3 are presentations.

https://spark-summit.org/eu-2015/

Releases

Apache Bigtop 1.0.0 was released this week. The release is based on Hadoop 2.6.0, includes HBase 0.98.12, Ignite 1.2.0, Spark 1.3.1, and much more.

http://mail-archives.apache.org/mod_mbox/bigtop-user/201508.mbox/%3C20150817185327.GT12957%40boudnik.org%3E

Mesosphere Infinity is a new product focussed at stream processing and big data applications. It's powered by Mesos, Spark, Kafka, Cassandra and Akka. Notably, the system uses Cassandra for persistence rather than HDFS.

http://www.infoworld.com/article/2973608/big-data/mesospheres-new-big-data-solution-add-spark-hold-the-hadoop.html

Cloudera Enterprise 5.4.5 was released this week. It includes fixes for Impala, Parquet support, MapReduce, and more. There are also several fixes to Cloudera Manager.

http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Enterprise-5-4-5/m-p/30895#U30895

Events

Curated by Datadog ( http://www.datadoghq.com )

UNITED STATES

California

Why Understanding How Spark Is Built Helps the Community (San Francisco) - Tuesday, August 25
http://www.meetup.com/spark-users/events/223632544/

Spark and Hadoop in Production: Real Life Stories (San Francisco) - Wednesday, August 26
http://www.meetup.com/Pepperdata/events/224407137/

Apache Ignite: In-Memory Data Fabric (Oakland) - Wednesday, August 26
http://www.meetup.com/eastbayjug/events/223440590/

Come Learn about PigPen: Data Engineering with Clojure and Hadoop (Los Gatos) - Thursday, August 27
http://www.meetup.com/cascading/events/224050557/

Distributed Stream and Graph Processing with Apache Flink (San Jose) - Thursday, August 27
http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/

Texas

What's New with Hortonworks 2.3, Product Review and IoT Demo (Arlington) - Tuesday, August 25
http://www.meetup.com/DFW-Analytics-Big-Data-and-Beyond/events/224423660/

Minnesota

Apache Drill (Saint Paul) - Thursday, August 27
http://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/224534295/

North Carolina

Overview of the Upcoming Apache Atlas Project (Charlotte) - Wednesday, August 26
http://www.meetup.com/CharlotteHUG/events/219153236/

Virginia

Is Apache Spark Ready for the Cloud? (Vienna) - Thursday, August 27
http://www.meetup.com/bigdatadc/events/223863127/

COLOMBIA

Hadoop in Industry + Example of Akka and Spark (Bogota) - Thursday, August 27
http://www.meetup.com/Big-Data-Science-Bogota/events/224560500/

GERMANY

Flink Meetup #10: Cluster Deployment on Docker, Flink/Gelly, Community Update (Berlin) - Wednesday, August 26
http://www.meetup.com/Apache-Flink-Meetup/events/223913365/

POLAND

End of Chaos in the Hadoop Cluster (Warsaw) - Wednesday, August 26
http://www.meetup.com/warsaw-hug/events/224501165/

EGYPT

Hands on with Apache Spark (Cairo) - Saturday, August 29
http://www.meetup.com/Cairo-Data-Science-Meetup/events/224737198/

CHINA

Spark Meetup (Shanghai) - Saturday, August 29
http://www.meetup.com/Shanghai-Apache-Spark-Meetup/events/224628824/

Data Technology Meetup #2 (Nanjing) - Sunday, August 30
http://www.meetup.com/Nanjing-Big-Data-Processing-Technology-Meetup/events/224756894/

TAIWAN

MLDM Monday: Big Data Series (Taipei) - Monday, August 24
http://www.meetup.com/Taiwan-R/events/222962960/

AUSTRALIA

Apache Spark 101 and Smart Cities and Big Data (Sydney) - Monday, August 24
http://www.meetup.com/Sydney-Big-Data-and-Analytics/events/224418400/