Data Eng Weekly


Hadoop Weekly Issue #109

22 February 2015

There was quite a bit of news this week with the announcement of the Open Data Platform, Pivotal open-sourcing several systems, and announcements related to Strata+Hadoop World. I've highlighted a few major announcements (there were too many to cover all in-depth), and I've also found a number of interesting technical articles covering Spark, Kafka, Cascalog, and more.

Technical

This post provides one of the best descriptions of a Data Lake that I've seen. It also talks about several common problems with, misconceptions of, and best practices for productionizing a data lake.

http://martinfowler.com/bliki/DataLake.html

The O'Reilly Radar blog has a post describing several compute frameworks for Hadoop--everything from SQL to machine learning to real-time. The post describes the key considerations for choosing a framework and gives some guidance as to when to use each.

http://radar.oreilly.com/2015/02/processing-frameworks-for-hadoop.html

Apache Spark is adding a new DataFrames API, which is inspired by data frames in R and Pandas (Python). DataFrames are like a table in a RDBMS, but contain additional optimizations. In particular, materialization of DataFrames uses the Spark SQL optimizer and code generation framework. There are more details on the API, which is planned for Spark 1.3, in the introductory post.

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

The ingest.tips blog has a walkthrough of a new feature in Kite 0.18.0, which allows importing of data using custom InputFormats.

http://ingest.tips/2015/02/17/kite-0-18-0-adds-custom-inputformat-support/

Answers is a near real-time mobile app analytics system built by Crashlytics/Twitter. The Twitter blog has a post describing the architecture of the system, which ingests billions of events per second. The system implements the Lamda architecture, using Kafka as the messaging layer, Storm for the speed layer, and EMR with Cascading for batch computation.

https://blog.twitter.com/2015/handling-five-billion-sessions-a-day-in-real-time

In last week's newsletter, there was mention of separating Spark from Hadoop. This week, Pinterest has written about just that--they're using Spark streaming with MemSQL for real-time analytics. The prototype system uses Spark streaming to take data from a Kafka topic, join it with dimensional data, and send the data to MemSQL.

http://engineering.pinterest.com/post/111380432054/real-time-analytics-at-pinterest

The MSDN blog has a post about tuning performance of Sqoop jobs on Azure HDInsight. The suggestions are mostly distribution-independent (e.g. tuning number of map tasks, sizing the cluster and db properly), so it's a useful read if you're working with Sqoop.

http://blogs.msdn.com/b/bigdatasupport/archive/2015/02/17/sqoop-job-performance-tuning-in-hdinsight-hadoop.aspx

The MongoDB blog has a tutorial on integrating MongoDB and Hive. The post describe how to use the MongoStorageHandler for Hive to query a Mongo-backed table.

http://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-2-hive-example

This post how the components of the MapReduce API fit together and the role of each. Topics covered include InputFormats, RecordReaders, and OutputCommitters.

https://www.mapr.com/blog/how-use-mapreduce-api

Netflix recently announced the Surus project, which is an open-source library of analysis tools for Pig and Hive. This week, they added the second function to the library: Robust Anomaly Detection (RAD). The Netflix blog has an overview of the goals of the tool, the algorithm it implements, and how it can be used via Apache Pig.

http://techblog.netflix.com/2015/02/rad-outlier-detection-on-big-data.html

This presentation describes best practices for building a data architecture. It contains ideas like using Kafka as a data bus, directory layouts for datasets in HDFS, using Spark streaming, and schema management. Lots of tips for building a reliable and consistent system.

http://www.slideshare.net/gwenshap/data-architectures-for-robust-decision-making

Cascalog, the Clojure library for Cascading, has recently added support for customer Hadoop counters (on master). This post describes how to update counters as part of a Cascalog job and how to access the counters programmatically afterwards.

http://www.samritchie.io/cascalog-hadoop-counters/

News

The Strata+Hadoop World conference was this week in San Jose. Videos of the Keynotes and select interviews have been published on Youtube. Included in the list is a Keynote by President Obama and the U.S. Chief Data Scientist, Dr. DJ Patil.

https://www.youtube.com/playlist?list=PL055Epbe6d5aWZSOZAZ4MX5xXKEvlT6y_

TechTarget has an overview of the benefits of a Hadoop-powered data lake. The article looks at Allstate and Solutionary Inc, who have both recently created data lakes. Example benefits include the ability to look at country-level data (at Allstate) for the first time and using large-scale machine learning to identify when home inspections aren't necessary for a homeowners insurance policy.

http://searchdatamanagement.techtarget.com/feature/Dip-in-Hadoop-data-lake-can-be-bracing-for-big-data-users

Hortonworks, Pivotal, IBM, GE, Verizon, and others announced the "Open Data Platform" (ODP) this week. The goal is to standardize Hadoop ecosystems components and versions to ease interoperability across distributions. Companies such as Cloudera, which didn't join the ODP, have responded negatively to the announcement. There have been a number of articles about this topic, but I find the Gartner blog has one of the best takes on both sides of the argument.

http://blogs.gartner.com/nick-heudecker/who-asked-for-odp/

Related to the ODP announcement, Pivotal and Hortonworks announced that they'll be "aligning efforts around Hadoop." As part of this, customers can choose to use either Pivotal HD or the Hortonworks Data Platform, and Hortonworks will provide advanced support for enterprise customers of both distributions.

http://hortonworks.com/blog/pivotal-hortonworks-announce-alliance/

Pivotal made another announcement this week which is easy to overlook given all the discussion around the Open Data Platform. The company is open-sourcing Greenplum, HAWQ, and GemFire database products (and still offering licenses and support). Greenplum is the company's analytics data warehouse, HAWQ is the SQL Engine for Hadoop, and GemFire is a in-memory distributed database.

https://gigaom.com/2015/02/17/pivotal-open-sources-its-hadoop-and-greenplum-tech-and-then-some/

Cloudera released information on company revenue and growth. They achieved ~100% year-over-year growth and over $100 million in revenue across 525 customers.

https://gigaom.com/2015/02/17/cloudera-claims-more-than-100m-in-revenue-in-2014/

Datanami reports that Hadoop's lack of enterprise security features including fine-grained access control is limiting and sometimes preventing enterprise adoption. The post mentions some companies that are selling products to add additional security features.

http://www.datanami.com/2015/02/19/will-poor-data-security-handicap-hadoop/

Databricks and Intel announced a partnership to optimize Spark for Intel architecture. Intel's work on core Hadoop helped bring encryption-at-rest and other important features to the platform, so it should be interesting to see what comes of this partnership.

http://blogs.intel.com/evangelists/2015/02/20/unlocking-promise-data-driven-world/

This post provides a recap of several themes that emerged at this week's Strata+Hadoop World. These include continued infatuation with Spark, security for Kafka, and a discussion around Spark streaming vs. Storm for stream processing.

http://ingest.tips/2015/02/21/hot-strata-2015/

Releases

Apache Cassandra 2.1.3 was released this week. The release contains over 100 fixes and improvements.

http://www.mail-archive.com/user@cassandra.apache.org/msg41000.html

IBM announced several new modules for their BigInsights distribution. These include BigInsights Analyst (for integrating spreadsheets and visualizations with their SQL-on-Hadoop engine), BigInsights Data Scientist (for machine-learning on large datasets), and BigInsights Statistical Management (for managing resources and optimizing workflows).

http://www.datanami.com/2015/02/19/ibm-embraces-hadoop-in-biginsight-push/

Cloudera announced that Apache Kafka has graduated from Cloudera Labs and is now fully-supported as part of Cloudera Enterprise. A technical post on the Cloudera blog describes how to deploy Kafka using CDH and includes some guidance for choosing hardware and sizing a cluster. It also describes various details of the architecture, such as replication, partitioning, and how to guarantee message delivery.

http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/2015/02/18/real-time-messaging-system-apache-kafka.html
http://blog.cloudera.com/blog/2015/02/how-to-deploy-and-configure-apache-kafka-in-cloudera-enterprise/

Microsoft announced availability of HDP 2.2, which includes Apache Storm, as part of their Azure HDInsight Hadoop-as-a-Service platform. They also announced a preview of HDInsight on Linux, which uses Apache Ambari for deployment.

http://hortonworks.com/blog/microsoft-azure-hdinsight-on-linux-expands-application-platforms/
http://blogs.microsoft.com/blog/2015/02/18/new-azure-services-help-people-realize-possibilities-big-data/

Hadoop-as-a-Service company Altiscale announced two new features this week. First, Apache Spark has been fully integrated into their platform. Second, they're now offering secure-mode for Hadoop using Kerberos.

https://www.altiscale.com/hadoop-blog/spark-fully-supported/
https://www.altiscale.com/hadoop-blog/kerberos-authentication/

Qubole has also added support for Apache Spark to their Qubole Data Services platform.

http://www.qubole.com/blog/product/qubole-apache-spark/

Tableau announced support for Spark SQL as part of the 8.3.3 release of Tableau. The connector is certified by Databricks.

http://money.cnn.com/news/newsfeeds/articles/prnewswire/SF32196.htm

MapR announced version 4.1 of their distribution. Key features include a bi-direction data replication between MapR-DB clusters in separate data centers, a POSIX client for loading data into MapR FS, and a new C API for MapR-DB.

http://www.datanami.com/2015/02/18/mapr-delivers-bi-directional-replication-with-distro-refresh/

Cloudera has released version 1.1 of Cloudera Director, their tool for provisioning CDH clusters in AWS. This release includes support for dynamically-resizing a cluster and an integration with Amazon's RDS (database-as-a-service). The Cloudera blog has more details and enumerates features planned for the future.

http://blog.cloudera.com/blog/2015/02/whats-new-in-cloudera-director-1-1/

Apache Gora is an in-memory data model and persistence framework for Apache HBase, Apache Cassandra, and several other data stores (both k/v and RDMBS). This week, version 0.6 was released. The release updates dependencies for several of the dependencies (HBase, Avro, Hadoop, and more) that it supports.

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201502.mbox/%3CCAGaRif1J3KB1mkj-9jspicqeO6v-28Ezx2QJgO2Gs4_52odevg@mail.gmail.com%3E

Druid, the time-series database open-sourced by Metamarkets, recently switched from the GPL to the Apache license.

https://gigaom.com/2015/02/20/the-druid-real-time-database-moves-to-an-apache-license/

Events

Curated by Datadog ( http://www.datadoghq.com )

UNITED STATES

California

Going from Hadoop to Spark: A Case Study (San Jose) - Monday, February 23
http://www.meetup.com/SF-Bay-ACM/events/220032641/

PredictionIO DASE Architecture with Spark MLlib (San Francisco) - Tuesday, February 24
http://www.meetup.com/SF-Bayarea-Machine-Learning/events/220418482/

The Lambda Architecture (Sunnyvale) - Wednesday, February 25
http://www.meetup.com/SF-Bay-Areas-Big-Data-Think-Tank/events/219395552/

Hadoop Multi-Tenancy Panel Discussion (Sunnyvale) - Wednesday, February 25
http://www.meetup.com/Pepperdata/events/220016192/

Hadoop RDBMS (San Ramon) - Wednesday, February 25
http://www.meetup.com/Analyzing-and-processing-BIG-Data/events/215743192/

Apache Drill: A Schema-free SQL Query Engine for Hadoop and NoSQL (Oakland) - Wednesday, February 25
http://www.meetup.com/eastbayjug/events/220434708/

What the Spark!? Intro and Use Cases (Mountain View) - Thursday, February 26
http://www.meetup.com/Scale-Warriors-of-Silicon-Valley/events/220196962/

Introduction to Hadoop Security, with Roman Shaposhnik (San Francisco) - Thursday, February 26
http://www.meetup.com/Pivotal-Open-Source-Hub/events/220168352/

Oregon

Intro to Apache Spark (Portland) - Wednesday, February 25
http://www.meetup.com/Portland-Data-Science-Group/events/220403370/

Michigan

Apache Storm Tech & Usecase (Troy) - Monday, February 23
http://www.meetup.com/Michigan-Hadoop-User-Group/events/220323944/

Hadoop Usergroup Kickoff Meeting (Lansing) - Tuesday, February 24
http://www.meetup.com/Lansing-Hadoop-Users-Group-Meetup/events/220284311/

North Carolina

Modern Data Integration: Paradigm Shift (Charlotte) - Wednesday, February 25
http://www.meetup.com/CharlotteHUG/events/219135237/

Virginia

Rapid Prototyping in PySparkStreaming (Arlington) - Tuesday, February 24
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/220337055/

Spark (Richmond) - Tuesday, February 24
http://www.meetup.com/Richmond-Java-Users-Group/events/219058368/

Let's Talk Hadoop Operations (Dulles) - Wednesday, February 25
http://www.meetup.com/Code-Brew/events/219909798/

Big Data Security Analytics with Apache Spark and GraphX (Vienna) - Thursday, February 26
http://www.meetup.com/bigdatadc/events/219875609/

Apache Spark & Real-Time Analytics (McLean) - Thursday, February 26
http://www.meetup.com/Hadoop-DC/events/220593864/

Maryland

Apache Spark and Amazon Workshop (Hanover) - Tuesday, February 24
http://www.meetup.com/Apache-Spark-Maryland/events/220273364/

New York

An In-Memory RDBMS as an Alternative to Storm (New York) - Wednesday, February 25
http://www.meetup.com/New-York-City-Storm-User-Group/events/220229069/

Massachusetts

3 Spark Talks (Cambridge) - Monday, February 23
http://www.meetup.com/Boston-Apache-Spark-User-Group/events/219772808/

Spark 0 to Prod in 30 days; Leverage Hadoop 2.0 and YARN with Native Tools (Boston) - Tuesday, February 24
http://www.meetup.com/bostonhadoop/events/219958105/

CANADA

February Meetup: Open Presentation Sessions (Toronto) - Monday, February 23
http://www.meetup.com/TorontoHUG/events/220296791/

IRELAND Hadoop Introduction, Use Cases, Case Studies & Distributions (Dublin) - Monday, February 23
http://www.meetup.com/hadoop-user-group-ireland/events/220127065/

ENGLAND Self-Service Data Exploration with Apache Drill (Manchester) - Wednesday, February 25
http://www.meetup.com/HadoopManchester/events/219803836/

SPAIN

Spark Coding Dojo: Scala (Barcelona) - Thursday, February 26
http://www.meetup.com/Spark-Barcelona/events/220256138/

INDIA

Apache Kafka + Zookeeper = 2 Million Writes per Second (Hyderabad) - Saturday, February 28
http://www.meetup.com/hyderabad-scalability/events/220582368/

Session on MapReduce with Python and Amazon EMR (Pune) - Saturday, February 28
http://www.meetup.com/Pune-Big-Data-Analytics-Meetup/events/219751224/

AUSTRALIA

High Performance Analytics on Top of Hadoop (Sydney) - Tuesday, February 24
http://www.meetup.com/Big-Data-Analytics/events/220598962/