Data Eng Weekly


Hadoop Weekly Issue #96

16 November 2014

Big news this week out of Palo Alto as Hortonworks has filed paperwork for an initial public offering. There were also a number of notable releases this week, including Apache Hive 0.14.0. Technical posts cover a large number of ecosystem topics, including Apache Sqoop, Apache Drill, and Apache Pig. There’s a lot of breadth in this issue, so there should be something for everyone!

Technical

The Cloudera blog has a guest post from Cerner about integrating Apache Kafka with HBase and Storm for real-time processing. The post describes how adopting Kafka helped reduce load on HBase (which was previously used for queuing) and improve performance. This style of Kafka-based architecture seems to be more and more common, but it’s always interesting to hear how folks are putting together the pieces of the Hadoop ecosystem.

http://blog.cloudera.com/blog/2014/11/how-cerner-uses-cdh-with-apache-kafka/

The MapR blog has a post on using the recently-released Apache Drill 0.6.0-incubating to analyze Yelp’s public data set. The data, which is a JSON file, can be queried directly via SQL in Drill without first declaring the data’s schema (drill auto-detects it). The post has a number of sample queries which you can use to get started analyzing this or any other data set.

https://www.mapr.com/blog/how-turn-raw-data-yelp-insights-minutes-apache-drill

The Cloudera blog has a second guest post, this time from Dell, on the new Oracle direct-mode in Sqoop 1.4.5. The post describes several of the implemented optimizations in the Oracle direct mode and includes an analysis of performance improvements the connector provides.

http://blog.cloudera.com/blog/2014/11/how-apache-sqoop-1-4-5-improves-oracle-databaseapache-hadoop-integration/

The Hortonworks blog has a post on using Apache Pig with the Python Scikit-learn package in order predict flight delays using logistic regression and random forests. The post is a bit light in details, but there is a linked IPython notebook which has a very detailed overview and description of the entire process. Given that Python is often a data scientist’s top choice for machine learning on small data sets, it’s useful to see how to extend it to larger data sets with Pig.

http://hortonworks.com/blog/data-science-apacheh-hadoop-predicting-airline-delays/

The ingest.tips blog has a post on Sqoop1 support for Parquet, which leverages the Kite SDK to generate Parquet files during import. The post serves as a good introduction to Sqoop1, which can both import data to HDFS and update the Hive metastore with information about the data. There are examples demonstrating how to use Parquet support.

http://ingest.tips/2014/11/10/parquet-support-arriving-in-sqoop/

Tephra is a open-source system that provides globally-consistent transactions for Apache HBase. Cask, the makers of Tephra, have written a blog post describing the requirements and design of Tephra. Tephra is designed in such a way that it can be used with systems other than HBase, and it is even designed to support transactions spanning multiple data stores.

http://blog.cask.co/2014/11/how-we-built-it-designing-a-globally-consistent-transaction-engine/

This presentation focusses on Spark streaming, the micro-batch component of Apache Spark. The slides give an introduction to both Spark and Spark streaming, describe several use cases (claiming there are 40+ known production use cases), give an overview of several integrations (Cassandra, Kafka, Elastic Search, and more), and look ahead to some upcoming features and improvements in the development pipeline.

http://www.slideshare.net/pacoid/tiny-batches-in-the-wine-shiny-new-bits-in-spark-streaming

News

Hortonworks has filed paperwork for their initial public offering this week. The filing includes a number of details on the company, including financial numbers ($33.4M in revenue so far in 2014), an overview of key company milestones, and number of employees (524 at the end of September). GigaOm has an analysis of some of these numbers and an overview of what the IPO means for the rest of the industry.

https://gigaom.com/2014/11/10/hadoop-startup-hortonworks-has-filed-for-an-ipo/
https://gigaom.com/2014/11/10/why-the-hortonworks-could-be-a-bellwether-for-hadoop/

IBM’s Big Data for Social Good Challenge opened this week. The challenge includes $40k in prizes, which will be awarded by a panel composed of IBM and industry experts. IBM has a curated list of datasets which can be used as part of a challenge entry.

https://developer.ibm.com/hadoop/2014/11/10/participate-big-data-social-good-challenge/

Releases

Apache Drill 0.6.0-incubating was recently released. 0.6.0 is the second beta release, primarily containing bug fixes. Notable new features include ANSI SQL support for MongoDB, partition pruning, and (alpha) window function support.

http://mail-archives.apache.org/mod_mbox/incubator-drill-user/201411.mbox/%3CCAA_-67d996Ec22tSgUKQGE-_Ck1FqhLdqbp1dNGZPRD6OGsxuQ%40mail.gmail.com%3E

Cubert is a new open-source tool from LinkedIn for writing high-performance MapReduce jobs. It’s a new language on the same level of Pig or Hive (sharing some resemblance to Pig) as well as a novel storage format/layer called blocks. For statistical calculations, graph computations, and OLAP cubes, Cubert offers impressive performance improvements. There’s a lot more information in the introductory blog post.

https://engineering.linkedin.com/big-data/open-sourcing-cubert-high-performance-computation-engine-complex-big-data-analytics

Apache Hive 0.14.0 was released this week. The release resolves over 1,000 (!) Jira issues. I’m sure we’ll soon hear more details about the release in blog post form but some quick highlights include: support for insert/update/delete with ACID support, a cost-based optimizer, support for data stored in Accumulo, support for HBase snapshots, and many improvements to ORCFile and HiveServer 2.

http://mail-archives.apache.org/mod_mbox/hive-user/201411.mbox/%3CCAH93c2ZaxVGtKp72QMiVrQ4d0XKRpEJr9d2t9orT3=z0bQVnOQ@mail.gmail.com%3E

Pivotal Cloud Foundry (CF) has added support for deploying Cassandra via DataStax Enterprise. The blog post introducing the feature has many more details as well as an example of setting up a cluster.

http://blog.pivotal.io/cloud-foundry-pivotal/features/an-easier-way-to-deploy-cassandra-clusters

Version 0.4.1 of the Spark Job Server has been released. The new version supports Spark 1.1.0 and has improvements for deployment/configuration.

https://github.com/spark-jobserver/spark-jobserver/releases/tag/v0.4.1
https://github.com/spark-jobserver/spark-jobserver/blob/v0.4.1/notes/0.4.1.markdown

Microsoft released version 2.5 of the Azure SDK and a preview of Visual Studio 2015. The releases contain support for HDInsight (the Hadoop as a Service component of Azure) including a Hive query editor and job viewer.

http://azure.microsoft.com/blog/2014/11/12/announcing-azure-sdk-2-5-for-net-and-visual-studio-2015-preview/

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Data Exploration in Spark (San Francisco) - Tuesday, November 18
http://www.meetup.com/San-Francisco-PyData/events/215142332/

Getting Started with Spark and Scala, by Paul Snively of Verizon OnCue (El Segundo) - Tuesday, November 18
http://www.meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/207973922/

OCBigData Monthly Meetup #7 (Irvine) - Wednesday, November 19
http://www.meetup.com/OCBigData/events/179381262/

49th Bay Area Hadoop User Group Monthly Meetup (Sunnyvale) - Wednesday, November 19
http://www.meetup.com/hadoop/events/152042012/

HBase Meetup @ WANdisco (San Ramon) - Thursday, November 20
http://www.meetup.com/hbaseusergroup/events/205219992/

Washington

Unlocking Your Hadoop Data with Apache Spark and CDH5 (Seattle) - Wednesday, November 19
http://www.meetup.com/Seattle-Spark-Meetup/events/169932382/

Oregon

MapR Presents Apache Drill: Self-Service Data Exploration (Portland) - Wednesday, November 19
http://www.meetup.com/Hadoop-Portland/events/216654112/

Apache Spark: Setup, Overview, and Comparison (Portland) - Wednesday, November 19
http://www.meetup.com/Portland-Data-Science-Workshops/events/215207692/

Kansas

Scalable In-Hadoop ETL Execution: Pentaho's Visual MapReduce (Overland Park) - Wednesday, November 19
http://www.meetup.com/Kansas-City-Data-Engineering-at-Scale/events/217433632/

Missouri

Securing the Hadoop Cluster (Saint Louis) - Tuesday, November 18
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/215019942/

Texas

Hadoop Like a Champion! (Austin) - Tuesday, November 18
http://www.meetup.com/CloudAustin/events/212247982/

Spark and Cassandra: Building and Deploying an Application (Austin) - Thursday, November 20
http://www.meetup.com/Austin-Cassandra-Users/events/211707542/

Utah

Hadoop Lunch at Adobe (Lehi) - Thursday, November 20
http://www.meetup.com/BigDataUtah/events/217120332/

Virginia

Hadoop Tutorial: Map-Reduce on YARN, Part 1 (Sterling) - Saturday, November 22
http://www.meetup.com/The-Sterling-dbuser-Meetup-Group/events/210492652/

Pennsylvania

Understanding the Foundations of Hadoop (Philadelphia) - Tuesday, November 18
http://www.meetup.com/Big-Data-Developers-in-Philadelphia/events/217612702/

North Carolina

Triangle SQL Server UG Meeting (Raleigh) - Tuesday, November 18
http://www.meetup.com/tripass/events/218643575/

Automating Customer Intelligence Management in Hadoop (Charlotte) - Wednesday, November 19
http://www.meetup.com/CharlotteHUG/events/167353212/

When to Use Pig instead of Hive (Winston Salem) - Thursday, November 20
http://www.meetup.com/Triad-Hadoop-Users-Group/events/208153612/

New Jersey

YARN + Docker Containers: Integration and Privilege Isolation (Hamilton Township) - Wednesday, November 19
http://www.meetup.com/nj-hadoop/events/206636262/

New York

Privilege Isolation in Docker Containers (New York) - Thursday, November 20
http://www.meetup.com/Hadoop-NYC/events/207004472/

Massachusetts

SQL on Hadoop: Hands-on (Boston) - Wednesday, November 19
http://www.meetup.com/Big-Data-Developers-in-Boston/events/215125502/

UNITED KINGDOM

November 2014 Hadoop Meetup (London) - Monday, November 17
http://www.meetup.com/hadoop-users-group-uk/events/217791892/

SINGAPORE

Analyzing Real-World Data with Drill, Hadoop & MongoDB | Tomer Shiran, MapR (Singapore) - Monday, November 17
http://www.meetup.com/BigData-Hadoop-SG/events/216571852/

GERMANY

Apache Cassandra, Apache Spark, and Hadoop Meetup (Munich) - Tuesday, November 18
http://www.meetup.com/Big-Data-Developers-in-Munich/events/217571312/

Patrick McFadin Talks C* & Spark for Time Series, plus A Leap Forward for SQL on Hadoop (Berlin) - Wednesday, November 19
http://www.meetup.com/Berlin-Cassandra-Users/events/217584792/

NETHERLANDS

Patrick McFadin Talks Cassandra, Spark, Tips and Tricks (Amsterdam) - Friday, November 21
http://www.meetup.com/Netherlands-Cassandra-Users/events/218615363/

HUNGARY

Big Data Meetup, ApacheCon Edition (Budapest) - Tuesday, November 18
http://www.meetup.com/Big-Data-Meetup-Budapest/events/208253412/

AUSTRALIA

Drilling in on SQL and Hadoop (Melbourne) - Wednesday, November 19
http://www.meetup.com/Big-Data-Analytics-Meetup-Group/events/218590301/

SPAIN

Databricks Comes to Barcelona (Barcelona) - Thursday, November 20
http://www.meetup.com/Spark-Barcelona/events/212164712/

INDIA

Big Data Meetup (Bangalore) - Friday, November 21
http://www.meetup.com/Bangalore-Hadoop-Meetups/events/216724732/

Hadoop Workshop (Hyderabad) - Saturday, November 22
http://www.meetup.com/hyderabad-scalability/events/217755662/