Data Eng Weekly


Hadoop Weekly Issue #78

13 July 2014

This week was fairly low-volume (at least in recent memory), but there are some good technical articles covering Hive, the Kite SDK, Oozie, and more. Also, the videos from HBaseCon were posted, and there were a number of ecosystem project releases.

Technical

The Pivotal blog has a post on setting up Pivotal HD, HAWQ (for data warehousing) and GemFire XD (for in-memory data grid) inside of VMs using Vagrant. The four node virtual cluster is setup with a single command, and the blog has more info on the configuration and the tools installed as part of the setup.

http://blog.gopivotal.com/pivotal/products/1-command-15-minute-install-hadoop-in-memory-data-grid-sql-analytic-data-warehouse

Datanami has a post about how Concur, who provides expense reporting software, is implementing Hadoop. They’re running a 40-node CDH cluster and currently using it for classification of expense report items and personalized recommendations. The post is full of anecdotes about their Hadoop rollout that will be useful for anyone in a similar situation.

http://www.datanami.com/2014/07/07/hadoop-remaking-travel-expense-reporting-concur/

The Cloudera Kite SDK provides tools and APIs for working with the components of the Hadoop ecosystem. One of these tools is Morphlines, which aims to streamline ETL. This two-part article talks about how to use Morphlines to validate records from a text file and save them into a Hive table. It goes through the Morphlines configuration file options and describes the steps of the process.

http://techidiocy.com/cloudera-kite-morphlines-getting-started-example/ http://techidiocy.com/anatomy-configuration-file-cloudera-kite-morphlines/

The Qubole blog has an article on best practices when working with Apache Hive. It covers how to organize your data on the file system (partitioning and bucketing), choosing serialization formats, configuration parameters to get the most of hive (parallel execution and vectorization), and more.

http://www.qubole.com/hive-best-practices/

This post covers PigPen, which is a MapReduce library for Clojure open-sourced by Netflix. It walks through some background on Hadoop, Apache Pig (which serves as the execution engine for PigPen), and Clojure. It also gives a brief introduction to Cascading and related projects (such as pattern, lingual, and drive), and how these compare to the pig-based stack that Netflix uses. Finally, it goes through some examples of PigPen jobs.

http://bugra.github.io/work/notes/2014-07-09/pigpen-hadoop-pig-clojure-cascading/

In the third part of their series on Apache Oozie, Altiscale has a number of tips for working with the workflow engine. The six tips mostly cover aspects of submitting and running jobs with Oozie.

https://www.altiscale.com/apache-oozie-tips-tricks/

Hortonworks has curated a list of presentations covering Hadoop operations from the recent Hadoop Summit. Slides and videos for each presentation are available via the Summit archive.

http://hortonworks.com/blog/apache-hadoop-operations-scale/

The Cloudera blog has a post on analyzing time-series data with Apache Crunch. The article covers generating Avro-serialized time-series data from Sequence Files (including the event time series avro schema), doing some simple analysis with the Crunch API (e.g. finding min, max, and counts), and doing a cross-join for multivariate analysis. The code for the post is available on github.

http://blog.cloudera.com/blog/2014/07/how-to-build-advanced-time-series-pipelines-in-apache-crunch/

The Databricks Cloud was announced at the Spark Summit last week. This post highlights some of the interesting features of the product, including dashboarding and real-time processing. As highlighted in the post, the Databricks Cloud makes it very easy to build products from data.

http://gradientflow.com/2014/07/12/databricks-cloud-makes-it-easier-to-build-data-products/

News

Recordings of presentations from HBaseCon were posted. There are talks from four tracks—operations, features & internals, ecosystem, and case studies.

http://hbasecon.com/archive.html

The Gartner blog has a post analyzing the rise of Apache Spark, which a number of vendors are jumping to support. It talks about how Spark tends to be easy to integrate (if a Hadoop integration was already done), and also how companies don’t want to be slow to adopt Spark (as many were for Hadoop).

http://blogs.gartner.com/nick-heudecker/spark-restarts-the-data-processing-race/

This week, Cloudera announced a partnership with Capgemini and Hortonworks announced a partnership with Accenture. In both agreements, Capgemini and Accenture will help customers deploy their partners Hadoop distribution. A post on SiliconAngle talks about how these types of partnerships show that Hadoop is maturing as an enterprise product.

http://siliconangle.com/blog/2014/07/11/tsunami-of-team-ups-reaffirms-accelerating-hadoop-maturity/

Actian, makers of the Actian Analytics Platform for SQL on Hadoop, announced a number of partnerships including one with Hortonworks.

http://www.marketwatch.com/story/industry-leaders-rally-behind-actians-sql-in-hadoop-platform-to-industrialize-hadoop-2014-07-08

Releases

InformationWeek has an article on the recently announced DataStax Enterprise 4.5 release. In addition to Spark support, the release has improved supports for joining data between a Cassandra cluster and a Hadoop cluster (DataStax says they don’t aim to solve DataWarehousing and are happy to leave that to Hadoop).

http://www.informationweek.com/big-data/big-data-analytics/datastax-cassandra-release-packs-more-than-spark/d/d-id/1279086

Jumbune is a profiler and debugger for Hadoop MapReduce. It offers per job, per job flow, and cluster-wide analysis tools. It was recently open-sourced under the LGPLv3 license by Impetus Technologies.

http://www.marketwired.com/press-release/impetus-open-source-solution-jumbune-to-accelerate-hadoop-based-solution-development-1926600.htm

Scoobi, the Scala library for building MapReduce jobs, released version 0.8.5 this week. The maintenance release includes a number of improvements and some bug fixes.

http://notes.implicit.ly/post/91095690499/scoobi-0-8-5

Spring for Apache Hadoop 2.0.1 was released. It bumps versions of several dependencies, including Apache Hadoop to 2.4.1.

http://spring.io/blog/2014/07/08/spring-for-apache-hadoop-2-0-1-released

Version 1.0.0 of Cloudera Oryx, a system for real-time machine learning and predictive analytics, was released. The release contains several new endpoints and bug fixes.

http://community.cloudera.com/t5/Data-Science-and-Machine/Oryx-1-0-0-released/m-p/14822

Cloudera Enterprise 5.0.3 was released. There are a number of fixes to the CDH stack, including Flume, HBase, HDFS, Hue, Oozie, YARN, and Solr.

http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Enterprise-5-0-3-CDH-5-0-3-and-Cloudera/m-p/14950#U14950

ProtectFile for Hadoop is new enterprise encryption software from SafeNet. ProtectFile offers encryption at rest for HDFS and includes automation tools for deploy.

http://data-protection.safenet-inc.com/2014/07/big-data-encryption-addresses-hadoop-security-concerns/

Pentaho 5.1, which was released in June, added support for Hadoop YARN. It also includes integrations with MongoDB, and has a Data Science Pack which integrates with R and Weka. This post from InformationWeek has many more details on the new release.

http://www.informationweek.com/big-data/big-data-analytics/pentaho-preps-data-on-hadoop-analyzes-on-mongodb/d/d-id/1279187

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Cloudera & Lucidworks: SolrCloud Failover, Testing, and Integration with Hadoop (Palo Alto) - Tuesday, July 15
http://www.meetup.com/SFBay-Lucene-Solr-Meetup/events/191046852/

46th Bay Area Hadoop User Group (HUG) Monthly Meetup (Sunnyvale) - Wednesday, July 16
http://www.meetup.com/hadoop/events/129795442/

Hadoop Ask Me Anything (Palo Alto) - Wednesday, July 16
http://www.meetup.com/Hadoop-Ask-Me-Anything/events/194173032/

OC Big Data Monthly Meetup #3 (Irvine) - Wednesday, July 16
http://www.meetup.com/OCBigData/events/179381122/

July SF Hadoop Users Meetup (San Francisco) - Wednesday, July 16
http://www.meetup.com/hadoopsf/events/189897052/

Hey Big Data, Meet Apache Spark, by Marco Vasquez of MapR (Santa Monica) - Wednesday, July 16
http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/events/175709772/

Colorado

In-Memory Computing Principles (Denver) - Monday, July 14
http://www.meetup.com/Data-Science-Business-Analytics/events/189837112/

Texas

Extending Apache Ambari (Houston) - Thursday, July 17
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/188066532/

Hadoop and Big R (Irving) - Saturday, July 19
http://www.meetup.com/Dallas-R-Users-Group/events/192928382/

Nebraska

Shawn Hermans Presents Big Data (Omaha) - Thursday, July 17
http://www.meetup.com/Heartland-Big-Data-Meetup/events/191993412/

Missouri

Apache Cassandra (Saint Louis) - Tuesday, July 15
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/189775412/

Illinois

Deep Learning: Theory, Practice and Predictions with H2O (Chicago) - Wednesday, July 16
http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/

Georgia

Beyond MapReduce: In-Memory Analysis with Spark and Shark (Atlanta) - Tuesday, July 15
http://www.meetup.com/atlcassandra/events/188461182/

North Carolina

Triad Hadoop Users Group (Winston Salem) - Thursday, July 17
http://www.meetup.com/Triad-Hadoop-Users-Group/events/187375842/

New York

Introduction to Apache Mesos (New York) - Monday, July 14
http://www.meetup.com/Apache-Mesos-NYC-Meetup/events/184053172/

A Leap Forward for SQL on Hadoop (New York) - Monday, July 14
http://www.meetup.com/Big-Data-Developers-in-NYC/events/189542182/

Massachusetts

Boston Spark User Group July Presentation Night (Cambridge) - Tuesday, July 15
http://www.meetup.com/Boston-Apache-Spark-User-Group/events/184426442/

SINGAPORE

Technical Workshop - Revolution Analytics and Cloudera (Singapore) - Monday, July 14
http://www.meetup.com/R-User-Group-SG/events/193625622/

GERMANY

Couchdoop and Other Consumer Use Cases from the Hadoop Ecosystem (Munich) - Thursday, July 17
http://www.meetup.com/Hadoop-User-Group-Munich/events/188851932/

POLAND

Hadoop 2.0 Processing Framework (Krakow) - Friday, July 18
http://www.meetup.com/datakrk/events/193755742/

INDIA

Hadoop Map-Reduce with Cascading (Hyderabad) - Saturday, July 19
http://www.meetup.com/Hyderabad-Programming-Geeks-Group/events/189970072/

Big Data Meetup (Bangalore) - Saturday, July 19
http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/194094032/

Hadoop Meetup (Bangalore) - Saturday, July 19
http://www.meetup.com/Bangalore-Baby-Hadoop-group/events/189310322/