Data Eng Weekly


Hadoop Weekly Issue #161

13 March 2016

This week's issue is short and sweet (although there sure are a lot of events!). In terms of long reads, there are very interesting posts on Kafka Streams and Flafka at Vodafone. On the release front, MapR, Apache Flink, and Apache Phoenix all had big releases. Congrats to the Flink team on achieving version 1.0!

Technical

The Azure Data Lake blog has a post demonstrating Scala implicits with Spark. Using the example of adding a saveToAzureSql method to a DataFrame, the post shows how to write a implicit conversion method along with the necessary JDBC code.

https://blogs.msdn.microsoft.com/azuredatalake/2016/03/01/extending-spark-with-extension-methods-in-scala-fun-with-implicits/

Spark's GraphX is a graph processing library that extends Spark RRDs. This introductory post gives some basic examples of the API and dives into some more advanced features (such as PageRank and Pregel-like calculations). The post is full of example code, which should be sufficient for getting going as a new user.

https://www.mapr.com/blog/how-get-started-using-apache-spark-graphx-scala

This presentation describes a cable company's migration from an Oracle exadata-based data warehouse to a Hadoop-based system for handling petabytes of data. During the transition, they tried out Phoenix, Impala, and Titan. From experience rolling out and productionizing Titan atop HBase, the post describes several lessons learned.

http://www.slideshare.net/roadan/not-your-dads-h-base-new

The Confluent blog has a post about Kafka Streams, a feature of the upcoming Kafka 0.10 (and also in a preview release of the Confluent Platform). Kafka Streams is a lightweight, "hipster" processing framework built to fill a gap realized by the LinkedIn team that built Apache Samza. It provides a lot of out-of-the-box support (such as joins and stateful processing) with a simple API and without requiring a separate distributed computing framework like YARN. The post dives pretty deep into why Kafka Streams is important and what type of use-cases it's built to solve.

http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple

The Altiscale blog has a post with tips for running scheduled Hadoop jobs from cron, and it motivates using Apache Oozie as a better alternative. Given that Oozie is built specifically for the Hadoop ecosystem, it supports features like Kerberos and also has data dependency support via coordinator actions.

https://www.altiscale.com/blog/scheduling-jobs-using-cron-or-oozie/

The Cloudera blog has a post about how Vodafone UK uses Flume with Kafka for event transport in their data infrastructure. The post describes their multi-datacenter architecture and several types of performance tuning that they performed. Using a three-node Kafka cluster and two Flume agents, they're able to process over 1 million events/sec (end-to-end).

http://blog.cloudera.com/blog/2016/03/building-benchmarking-and-tuning-syslog-ingest-architecture-at-vodafone-uk/

News

Dell and BlueData, makers of the EPIC software for provisioning docker-based Hadoop clusters, announced a partnership this week.

http://www.bluedata.com/blog/2016/03/dell-and-bluedata-better-together/

Releases

MapR 5.1 shipped this week. It includes Hadoop, Spark, MapR streams (general availability), and more. MapR touts the first-class support for JSON across real-time event streaming, MapR-DB, and other parts of the system. Other features include security enhancements (access control expressions and selective auditing), SSD optimizations, and improved Docker support. The MapR blog has many more details, and CIO has more coverage of the improved container support.

https://www.mapr.com/blog/mapr-converged-data-platform-release-real-time-reliable-results
http://www.cio.com/article/3041508/open-source-tools/mapr-delivers-support-for-containers-security-in-latest-hadoop-release.html

Apache Flink 1.0.0 was released this week. Key highlights include public API compatibility for 1.x releases, support for complex event processing, improved support for high-memory operations, and improved monitoring.

https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces88
http://flink.apache.org/news/2016/03/08/release-1.0.0.html

The Hortonworks blog has details of the features of Apache Ambari 2.2, which is part of HDP 2.4. The most notable features are automated upgrades, simplified security options, and additional troubleshooting information.

http://hortonworks.com/blog/announcing-apache-ambari-2-2/

Apache Apex Malhar 3.3.1-incubating was released this week. Malhar is the development library with prebuilt connectors/operators/etc for Apex. In this release, the team has fixed a number of bugs.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201603.mbox/%3CCAHvM9d2fA_punsn=xFPm1zyLM9LZTmtZAxwtGmVEigAAQME-aA@mail.gmail.com%3E

Apache Kudu (incubating) released version 0.7.1. This fixed a handful of high-priority bugs.

http://getkudu.io/releases/0.7.1/docs/release_notes.html#rn_0.7.1

Apache Phoenix, the SQL-on-HBase system, announced version 4.7 this week. The new release includes beta support for ACID transactions, enhanced consistency guarantees for secondary indexes, improved improved performance, and over 150 bug fixes.

https://blogs.apache.org/phoenix/entry/announcing_phoenix_4_7_released

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

#OCBigData Meetup (Irvine) - Wednesday, March 16
http://www.meetup.com/OCBigData/events/228653301/

Big Data Application Meetup (Palo Alto) - Wednesday, March 16
http://www.meetup.com/BigDataApps/events/228039399/

Malhar & Geode Integration; Ingest: Kafka to Hadoop with Apex & Results Into Geode (San Jose) - Thursday, March 17
http://www.meetup.com/Apex-Bay-Area-Chapter/events/228593080/

Washington

Big Data and Retail: Building Shopping Lists and Data Processing Engines (Bellevue) - Wednesday, March 16
http://www.meetup.com/Big-Data-Bellevue-BDB/events/222646100/

Ohio

Cleveland Big Data and Hadoop User Group (Mayfield Village) - Monday, March 14
http://www.meetup.com/Cleveland-Hadoop/events/228062615/

North Carolina

Spark vs. Hadoop for Big Data (Durham) - Tuesday, March 15
http://www.meetup.com/Research-Triangle-Analysts/events/228852457/

Virginia

Apache Spark Proof of Technology by IBM (McLean) - Tuesday, March 15
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/229094350/

Analyzing Event Streams Using Spark and GraphX w/ Myles Baker (Richmond) - Tuesday, March 15
http://www.meetup.com/804RVA/events/228929180/

Real-Time Aggregations, Approximations, Similarities, and Recommendations (McLean) - Tuesday, March 15
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/229298675/

Pennsylvania

LVTech TechTalk: Big Data, Hadoop, and All That (Bethlehem) - Tuesday, March 15
http://www.meetup.com/lvtech/events/227455461/

New Jersey

How to Build a Recommendation Engine Using Spark 1.6 and HDP (Princeton) - Thursday, March 17
http://www.meetup.com/nj-datascience/events/229292885/

New York

Introduction to Hadoop (Syracuse) - Wednesday, March 16
http://www.meetup.com/Central-New-York-Software-Developers-Meetup/events/228811089/

Integrating Apache Flink and Apache NiFi (New York) - Wednesday, March 16
http://www.meetup.com/futureofdata-newyork/events/229412331/

Massachusetts

Hybrid Solution Analysis of Streaming Sensor Data with Spark Streaming & Kafka (Boston) - Tuesday, March 15
http://www.meetup.com/Big-Data-Developers-in-Boston/events/228978344/

St. Patty's Day Meet-Up on an Introduction to Apache Kudu (Boston) - Thursday, March 17
http://www.meetup.com/bostonhadoop/events/229317099/

BRAZIL

Apache Flink Real-World Use Cases with Slim Baltagi (Sao Paulo) - Thursday, March 17
http://www.meetup.com/Brazil-Sao-Paulo-Apache-Flink-Meetup/events/229257747/

SPAIN

Configuring the Layered Cake of Hadoop + Scaling Remote Engineering Teams (Sevilla) - Thursday, March 17
http://www.meetup.com/Bitnami-Sevilla/events/229482829/

FRANCE

Data Munging with Spark, Part I (Toulouse) - Tuesday, March 15
http://www.meetup.com/Tlse-Data-Science/events/229338356/

NETHERLANDS

Office Hours with Holden Karau (Amsterdam) - Monday, March 14
http://www.meetup.com/Amsterdam-Spark/events/228667345/

DENMARK

"Extreme" Apache Spark (Copenhagen) - Tuesday, March 15
http://www.meetup.com/Big-Data-Denmark/events/229071135/

SWITZERLAND

Real-Life Apache Spark: Tips and Tricks from the Trenches (Zurich) - Monday, March 14
http://www.meetup.com/spark-zurich/events/229251189/

ITALY

Drilling into Data with Apache Drill + Stream-based Microservice Architecture (Milano) - Thursday, March 17
http://www.meetup.com/HUG-Italy/events/228721026/

INDIA

Understanding and Building Big Data Architectures (Hyderabad) - Saturday, March 19
http://www.meetup.com/hyderabad-scalability/events/228780848/