Data Eng Weekly


Hadoop Weekly Issue #68

04 May 2014

There are several articles this week covering deploying Hadoop, including two on integrating Hadoop and Docker. Given how hard it can be to test out Hadoop (let alone deploy to production), it’s always promising to see new tools and systems being used. Videos from Hadoop Summit Amsterdam were posted, and there are several new releases including a Tech Preview of Spark on HDP, and a new version of Impala. Enjoy all of the content to consume and new software to try out!

Technical

The Pivotal blog has a post on running the Pivotal HD distribution inside of Docker. By utilizing pre-packaged docker images, it's very simple to get an environment up and running. The tutorial includes setting up MapReduce as well as HAWQ, the SQL-on-Hadoop system from Pivotal. There are some docker containers for other distributions, so it should be possible to adopt this tutorial to other environments.

http://blog.gopivotal.com/pivotal/products/6-easy-steps-deploy-pivotals-hadoop-on-docker

Another post on getting a Hadoop cluster going quickly, this time using Puppet to provision virtual machines running in Virtualbox using Vagrant. Specifically, this bootstraps 3 VMs with Apache Ambari, at which point you can use the management software to install and configure the Hadoop daemons. If you want to try out Ambari, this is a good way to do so pretty quickly.

https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/

Rather than running Hadoop in Docker, this post discusses some upcoming support for running docker containers inside of YARN. Docker supports pre-baked images that can contain libraries and binaries not found on the host, making it possible to run jobs with vastly different sets of dependencies on the same compute node (akin to virtualization, but with much less overhead). The Register has more details on the integration, including interviews with Altiscale CEO Raymie Stata and Hortonwork’s Arun Murthy.

http://www.theregister.co.uk/2014/05/02/docker_hadoop/

The Sqrrl blog has a post on recent news related to big data security. It coverts HDFS ACLs, Apache Knox, MongoDB 2.6, and Cloudera Search. The post wraps up with details about the security features of Sqrrl Enterprise.

http://sqrrl.com/big-data-security-roundup/

The Cloudera blog has a post on the recently announced python client for Impala, impyla. It contains a walkthrough on the API, including the preview APIs for integrating with scikit-learn and shipping python udfs.

http://blog.cloudera.com/blog/2014/04/a-new-python-client-for-impala/

Apache BigTop is a system for building Hadoop ecosystem components into a cohesive unit, which is used to package most Hadoop distributions. This post walks through how BigTop builds RPM packages for each of the components.

http://jayunit100.blogspot.com/2014/04/how-bigtop-packages-hadoop.html

A guest post on the Cloudera blog by WibiData engineer Jonathan Natkins describes how to integrate a custom service into Cloudera Manager. The integration relies on a new feature of Cloudera Manager 5 called custom server descriptors. If you’re using Hadoop ecosystem components not supported by Cloudera with CDH, this offers an opportunity to manage them alongside the Hadoop services.

http://blog.cloudera.com/blog/2014/04/how-to-extend-cloudera-manager-with-custom-service-descriptors/

The DataStax blog has an interesting article explaining how they provision and test Cassandra across multiple data centers and 1000 nodes in the cloud.

http://www.datastax.com/dev/blog/testing-cassandra-1000-nodes-at-a-time

The Hortonworks blog is doing a series on resilience/high-availability for the YARN Resource Manager (RM). The first phase of this work is implemented, which is a mechanism for persisting the state of the RM to a data store (HDFS and Zookeeper are implemented). Clients must use a new RMProxy library to survive a RM restart.

http://hortonworks.com/blog/rm-yarn-resilience/ http://hortonworks.com/blog/apache-hadoop-yarn-resilience-hadoop-yarn-applications-across-resourcemanager-restart-phase-1/

MortarData has a post about integrating MongoDB and Hadoop. The post includes links to their documentation that describe several strategies for accessing MongoDB data in Hadoop, and there is a video from their CEO describing how to build a recommendation engine with Hadoop and MongoDB.

http://blog.mortardata.com/post/84327807886/build-a-recommendation-engine-with-mongodb-and-hadoop

News

Videos from Hadoop Summit in Amsterdam in early April have been posted online. The talks cover five tracks, and slides for many of the talks are posted, too.

http://hadoopsummit.org/amsterdam/schedule/

In a post entitled “Spark on fire,” the DBMS2 blog describes recent Spark news and how companies are deploying Spark. The post notes that Spark 1.0 is expected to be released later this month, and discusses SparkSQL and applications of Spark for machine learning.

http://www.dbms2.com/2014/04/30/spark-on-fire/

Another post on the DBMS2 blog covers Cloudera’s SQL-on-Hadoop positioning. Cloudera supports both Hive and Impala, and it’s not always clear which system should be used for which type of processing (at least in the longer term). It’ll also be interesting to see how Shark and SparkSQL fit into Cloudera’s strategy.

http://www.dbms2.com/2014/04/30/cloudera-impala-data-warehousing-and-hive/

Cloudera and MongoDB have expanded their partnership to include co-marketing and co-selling of each others software. There are also plans to support live-snapshotting of MongoDB data to a Hadoop cluster for analysis.

http://www.informationweek.com/big-data/software-platforms/mongodb-cloudera-form-big-data-partnership/d/d-id/1234919

Pepperdata, makers of Hadoop cluster supervisor and analysis software, have announced a Series A round of financing totaling $5M. They will use the money to grow their team and further product development.

http://www.datacenterknowledge.com/archives/2014/04/29/pepperdata-raises-5-million-grow-hadoop-solution/ http://pepperdata.com/news/pepperdata-raises-5M-in-funding/

In a third of three posts this week, the DBMS2 blog enumerates the details (and adds some speculation) on the recent Intel investment in Cloudera. It includes some of the short and medium-term goals of the relationship and specifics on the financial transaction.

http://www.dbms2.com/2014/04/30/the-intel-investment-in-cloudera/

ComputerWeekly has an article that explores whether Hadoop should complement or replace a data warehouse. It paints a picture of Hortonworks being in the “complement” camp while Cloudera is in the (eventually) “replace” camp. It also includes quotes from Teradata CTO, who doesn’t think that replacing a EDW with Hadoop makes financial sense.

http://www.computerweekly.com/feature/Cloudera-v-Hortonworks-Hadoop-to-complement-replace-data-warehouse

InformationWeek has a story on Datameer’s software, which takes a different approach than other systems. Instead of relying on a SQL-on-Hadoop system to answer queries to power a BI tool, it offers a spreadsheet and visualization tool that operates directly on data in HDFS or another data store.

http://www.informationweek.com/big-data/software-platforms/datameer-bets-visual-analysis-beats-sql-on-hadoop/d/d-id/1234873

Releases

Hortonworks has announced a Tech Preview of Apache Spark for HDP 2.1. The preview is based on Apache Spark 0.9.1 and Hortonworks has published rpms and debs for installing the software.

http://hortonworks.com/blog/announcing-hdp-2-1-tech-preview-component-apache-spark/

Cloudera announced the 1.3.1 release of Impala. The new version includes improvements to memory handling and additional SQL functions.

http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Impala-1-3-1/m-p/11638

Apache Tajo 0.8.0 was released. Tajo is a low-latency SQL on Hadoop (as well as additional platforms/data stores) distributed system. The new release includes a number of new SQL features, improved performance and scalability, added support for new storage systems and formats (including Amazon S3 and Parquet), and much more. The Apache blog has full coverage of the new features.

https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0

Apache Kafka 0.8.1.1 was released. This is a bug fix release containing 13 fixes, including a fix for a deadlock.

https://archive.apache.org/dist/kafka/0.8.1.1/RELEASE_NOTES.html

Radoop 2.0 was released this week from the company of the same name. Radoop integrates predictive analytics tools from RapidMiner with Hadoop.

http://www.datanami.com/2014/04/30/radoop:_a_predictive_analytic_alternative_to_r_on_hadoop/

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

HBaseConHackathon (San Francisco) - Tuesday, May 6, 2014
http://www.meetup.com/hackathon/events/176659262/

Arizona

An Introduction to Apache HBase, MapR Tables, and Security (Phoenix) - Wednesday, May 7, 2014
http://www.meetup.com/Phoenix-Hadoop-User-Group/events/174568242/

Colorado

Revenue Management and Hadoop, 'Data Hubs' & the Data Center Transformation (Boulder) - Thursday, May 8, 2014
http://www.meetup.com/CU-Leeds-Business-Analytics/events/179351422/

Texas

Advanced Hadoop Based Machine Learning (Austin) - Wednesday, May 7, 2014
http://www.meetup.com/Austin-ACM-SIGKDD/events/171159702/

Ohio

Teradata & The Ohio State University to Present (Dublin) - Tuesday, May 6, 2014
http://www.meetup.com/Data-Analytics-Learning-Community/events/175816102/

District of Columbia

Big Data Week 2014 Meetup (Washington) - Monday, May 5, 2014
http://www.meetup.com/Accumulo-Users-DC/events/178839102/

Maryland

2nd Annual Big Data Breakfast (Columbia) - Tuesday, May 6, 2014
http://www.meetup.com/Data-Science-MD/events/178927102/

New York

Hadoop Developer Day (New York) - Tuesday, May 6, 2014
http://www.meetup.com/Big-Data-Developers-in-NYC/events/173144062/

Bridging the gap, OLTP and Real-Time Analytics in a Big Data World (New York) - Tuesday, May 6, 2014
http://www.meetup.com/mysqlnyc/events/174804632/

Apache Spark - Easier and Faster Big Data + Collaborative Filtering (New York) - Wednesday, May 7, 2014
http://www.meetup.com/Spark-NYC/events/177785522/

Intermediate Workshop II: Writing MapReduce Applications (New York) - Friday, May 9, 2014
http://www.meetup.com/New-York-Big-Data-Workshop/events/176417142/

INDIA

BigDataCloud Mini Conference 2014 (Bangalore) - Tuesday, May 6, 2014
http://www.meetup.com/CloudStack-Bangalore-Group/events/177323412/

Introduction to Hadoop (Mumbai) - Wednesday, May 7, 2014
http://www.meetup.com/Mumbai-Big-Data-Enthusiasts/events/180242822/

Hadoop by example (Hyderabad) - Saturday, May 10, 2014
http://www.meetup.com/hyderabad-scalability/events/175499922/

Bangalore Baby Hadoop Meetup (Bangalore) - Saturday, May 10, 2014
http://www.meetup.com/Bangalore-Baby-Hadoop-group/events/177228582/

AUSTRALIA

Special Event: Future of Data - Doug Cutting, Founder of Hadoop (Sydney) - Tuesday, May 6, 2014
http://www.meetup.com/Big-Data-Analytics/events/177821612/

CANADA

SQL for Hadoop (Ontario) - Wednesday, May 7
http://www.meetup.com/Big-Data-Developers-in-Toronto/events/178343772/