Data Eng Weekly


Hadoop Weekly Issue #171

22 May 2016

There were quite a few releases this week, including a new open-source project from LinkedIn. On the technical and news front, there are several articles recapping Apache: Big Data North America, and there's an excellent series of posts about analyzing NYC Taxi data across several different data systems.

Technical

The Databricks blog has a post about two approximation algorithms that are available in Apache Spark. They are approxCountDistict, which estimates the number of distinct values, and approxQuantile, which generates approximate percentiles. The post describes the algorithms and visualizes the accuracy for varying residuals.

https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html

This tutorial describes how to use Apache Hadoop HDFS, Apache Solr, and Hue to store, index, and search for medical images stored in the DICOM format. The post includes a walkthrough of the steps needed to load and fetch the data.

http://blog.cloudera.com/blog/2016/05/how-to-process-and-index-medical-images-with-apache-hadoop-and-apache-solr/

MapR Streams is a system that is API compatible with Apache Kafka. This post describes, at a high-level, the similarities and differences between MapR Streams and Kafka. There's also a clarification of how Kafka Streams relates to MapR Streams.

https://www.mapr.com/blog/apache-kafka-and-mapr-streams-terms-techniques-and-new-designs

This post is one of the clearest explanations of Paxos, the consensus protocol for distributed systems, that I've seen. The article includes examples of plotting computers and distributed auctions to help illustrate the protocol.

http://ifeanyi.co/posts/understanding-consensus/

Based on a presentation at the recent Apache: Big Data North America, Datanami has a look at the new features in the upcoming Apache Hadoop 3 release. Among the highlights are the shell script rewrite, task-level native optimization, the capability to derive memory sizes automatically, and support for erasure codings in HDFS. The post looks closely at erasure codings which should improve storage efficiency (1.5x disk consumption rather than 3x).

http://www.datanami.com/2016/05/18/hadoop-3-poised-boost-storage-capacity-resilience-erasure-coding/

This presentation from PyData Berlin describes a future in which Apache Arrow and the Feather file format are the main mechanism for interoperability for data across languages/frameworks.

http://www.slideshare.net/wesm/python-data-ecosystem-thoughts-on-building-for-the-future

Videos of two Apache Kafka-related talks from two separate conferences have been posted. The first describes the new security features in Kafka, and the second explores using Kafka to share data across systems.

https://www.oreilly.com/learning/securing-apache-kafka
https://www.infoq.com/presentations/event-streams-kafka

This blog has a collection of posts about loading/querying the New York City taxi data via various data systems like Amazon Redshift, Google BigQuery, Postgres, and Presto. In addition to raw benchmarking, there are details about troubleshooting, optimizations, and comparing alternatives (such as S3 vs HDFS in AWS).

http://tech.marksblogg.com/all-billion-nyc-taxi-rides-redshift.html

O'Reilly has an article describing how to implement the kappa architecture with Kafka, Flink, Elasticsearch, and Kibana. The post gives an overview of the lambda and kappa architectures, describes the major architecture components, and describes how to use the setup to detect novelties using Bayesian models.

https://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry

News

This post about the recent Apache: Big Data North America conference enumerates many of the big data ecosystem projects that were covered at the conference. There are quite a few, including several that weren't yet on my radar.

http://www.datanami.com/2016/05/11/open-source-tour-de-force-apache-big-data-2016/

The Pivotal blog has an interesting post on big data and agile development. Big data systems are often stuck in a non-agile world in which requirements are gathered and schemas are defined well before data is pulled in. The post argues that the constraints that necessitate this approach (limited capacity and performance, silo'd data, etc), are no long valid in a cloud-based environment.

https://blog.pivotal.io/big-data-pivotal/features/when-it-comes-to-big-data-cloud-and-agility-go-hand-in-hand

Databricks has published a recording of their webinar "Apache Spark MLlib: From Quick Start to Scikit-Learn" for on-demand viewing. In addition to the webinar content, they've posted the answer to eight common questions from the session.

https://databricks.com/blog/2016/05/18/spark-mllib-from-quick-start-to-scikit-learn.html

The Hortonworks blog has a post overviewing the history of Apache Storm. Open-sourced in 2011, Storm moved to the Apache incubator in 2013, became a top-level project in 2014, and hit its 1.0 release earlier this year. The article discusses the major technical advances in each of those milestones and more.

http://hortonworks.com/blog/brief-history-apache-storm/

HBaseCon is this week in San Francisco. The conference includes keynotes from Apple, Yahoo, and Facebook.

http://hbasecon.com

MapR has an infographic celebrating the last year of Apache Drill. In that time, it's released 7 times and hit a number of impressive milestones.

https://www.mapr.com/blog/happy-anniversary-apache-drill-what-difference-year-makes

Datanami has an article covering a Q&A at Apache: Big Data North America with ASF director Jim Jagielski and ODPi program director John Mertic. The main topic, as expected, was the relationship between the ASF and ODPi.

http://www.datanami.com/2016/05/20/apache-foundation-keeps-eyes-wide-open-odpi/

Releases

LinkedIn has open-sourced Ambry, their ObjectStore distribute system. The code for Ambry is on github, and the introductory blog post has a thorough overview of Ambry's targeted SLAs, design goals, architecture, and interfaces.

https://engineering.linkedin.com/blog/2016/05/introducing-and-open-sourcing-ambry---linkedins-new-distributed-

Pivotal HDB 2.0, which is powered by apache HAWQ (incubating) and provides an analytics database for Hadoop, was released this week.

https://blog.pivotal.io/big-data-pivotal/products/fail-fast-and-ask-more-questions-of-your-data-with-hdb-2-0

Version 0.12.1 of Apache Mahout, the machine learning and data mining system, was released this week. The release addresses a number of issues with the Flink/Mahout integration.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201605.mbox/%3CCAOtpBjhshagyLN3Qnt0xRnc7YbnMVJjTS4piVXL7LiS2pQguXw@mail.gmail.com%3E

Version 0.11.3 of Apache Tajo, the data warehouse for Hadoop, was released. The new release fixes 5 bugs.

http://tajo.apache.org/releases/0.11.3/announcement.html

MongoDB has announced a new MongoDB Connector for Apache Spark. Versus the Hadoop InputFormat shim for Spark, this connector has a number of features. In addition to the announcement, there's another post explaining some of the key features.

https://www.mongodb.com/blog/post/mongodb-connector-for-apache-spark-announcing-early-access-program-and-new-spark-training
http://rosslawley.co.uk/introducing-a-new=mongodb-spark-connector/

SyncSort has released DMX-h v9, which adds support for Kafka and a new Intelligent Execution framework.

http://insidebigdata.com/2016/05/20/syncsorts-latest-innovations-simplify-integration-of-streaming-data-in-spark-kafka-and-hadoop-for-real-time-analytics/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Meetup on the Night before HBaseCon2016! (San Francisco) - Monday, May 23
http://www.meetup.com/hbaseusergroup/events/230547750/

Solr as a SparkSQL DataSource (San Francisco) - Monday, May 23
http://www.meetup.com/Downtown-SF-Apache-Lucene-Solr-Meetup/events/230554530/

PhoenixCon (San Francisco) - Wednesday, May 25
http://www.meetup.com/SF-Bay-Area-Apache-Phoenix-Meetup/events/230545182/

Storm/Kafka Meetup: “Securing Kafka Clusters” (San Francisco) - Wednesday, May 25
http://www.meetup.com/futureofdata-sanfrancisco/events/230650315/

The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO for Data, Microsoft (Mountain View) - Wednesday, May 25
http://www.meetup.com/SF-Bay-Areas-Big-Data-Think-Tank/events/230129339/

Stream Your Operational Data w/ Apache Spark & Kafka into Hadoop Using Couchbase (Santa Monica) - Thursday, May 26
http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/events/231173919/

Washington

Apache Phoenix + More (Seattle) - Wednesday, May 25
http://www.meetup.com/Seattle-Scalability-Meetup/events/229506359/

Spark Streaming Primer and TUNE Case Study (Seattle) - Thursday, May 26
http://www.meetup.com/Seattle-Spark-Meetup/events/230026396/

Texas

Spark Hands-on 1-Day Workshop for Data Engineers, Data Scientists and Developers (Coppell) - Tuesday, May 24
http://www.meetup.com/Big-Data-Developers-in-Dallas/events/230924254/

Cloudy to Clear: Big Data and Insights with Azure (Houston) - Tuesday, May 24
http://www.meetup.com/AzureHouston/events/230744053/

What Is All the Hype about Apache Spark (Coppell) - Tuesday, May 24
http://www.meetup.com/Big-Data-Developers-in-Dallas/events/230748657/

Cloudera User Group Meetup (Plano) - Wednesday, May 25
http://www.meetup.com/DFW-Cloudera-User-Group/events/230547045/

Minnesota

Apache Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data (Saint Paul) - Thursday, May 26
http://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/230598640/

Illinois

Flinking Even Faster with Iterations and Delta Iterations (Chicago) - Thursday, May 26
http://www.meetup.com/Chicago-Apache-Flink-Meetup/events/231080374/

North Carolina

May CHUG: Cloudera on Kafka (Charlotte) - Wednesday, May 25
http://www.meetup.com/CharlotteHUG/events/227293954/

Virginia

Spark Streaming and the Internet of Things (Arlington) - Tuesday, May 24
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/230931553/

New Jersey

Interactive Real-Time Streaming with Spark 2.0: Structured Streaming (Princeton, NJ 08544) - Wednesday, May 25
http://www.meetup.com/nj-datascience/events/230869222/

New York

KNN with Apache Flink by the Implementor, Dan Blazevski (New York) - Tuesday, May 24
http://www.meetup.com/ny-scala/events/231163636/

Massachusetts

Special Spark Presentation Night (Somerville) - Tuesday, May 24
http://www.meetup.com/Boston-Apache-Spark-User-Group/events/231190031/

CANADA

Toronto Apache Spark #9 (Toronto) - Wednesday, May 25
http://www.meetup.com/Toronto-Apache-Spark/events/230677223/

UNITED KINGDOM

Python for Data Engineers and How to Blend the Database World with Apache Spark (London) - Tuesday, May 24
http://www.meetup.com/Data-Science-Festival-London/events/230711201/

GERMANY

Apache Flink Meetup Berlin #14 (Berlin) - Tuesday, May 24
http://www.meetup.com/Apache-Flink-Meetup/events/231093625/

CZECH REPUBLIC

HBase and MySQL Ecosystem for Real-Time Views of Data (Prague) - Thursday, May 26
http://www.meetup.com/CS-HUG/events/230835837/

HUNGARY

DataFrames and Spark SQL in Network Analytics (Budapest) - Wednesday, May 25
http://www.meetup.com/Budapest-Spark-Meetup/events/230682817/

GREECE

Big Data Meetup: Apache Storm‏, Backgammon AI Agents (Athens) - Tuesday, May 24
http://www.meetup.com/Athens-Big-Data/events/230967818/

UNITED ARAB EMIRATES

Hortonworks Data Platform: International Speakers (Dubai) - Monday, May 23
http://www.meetup.com/UAE-Big-Data-Group/events/231157498/

INDIA

Understanding and Building Big Data Architectures, Part 3: Kafka (Hyderabad) - Saturday, May 28
http://www.meetup.com/hyderabad-scalability/events/229886391/

Machine Learning Pipelines with Spark ML (Bangalore) - Saturday, May 28
http://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/230898948/