Data Eng Weekly


Hadoop Weekly Issue #147

29 November 2015

With the holiday in the US, it was a relatively light news week, but there are good technical articles covering stream processing, Spark, and more. Also, there were a number of releases—Apache Drill, Apache Flink, and Apache Kafka.

Technical

This post describes Apache Flink's DataStream API by way of an example program that processes Tweets from the Twitter API. It covers setting up a local development environment, how to write a custom StreamGenerator (of Tweets), and how to run the program via the Flink command-line utils.

http://blog.brakmic.com/stream-processing-with-apache-flink/

This presentation gives a practical overview of two popular stream processing frameworks—Apache Storm and Apache Spark Streaming. There's some advice about both (including pros and cons of each) as well as some rules for when to use one or the other.

http://www.slideshare.net/MammothData/all-things-open-2015-spark-storm-when-where

This post describes four ways to integrate R with Hadoop. There's also an example of using the RHadoop library for interacting with data in HDFS and running a MapReduce job.

http://www.edureka.co/blog/4-ways-to-use-R-and-Hadoop-together

The morning paper covered "Asynchronous Complex Analytics in a Distributed Dataflow Architecture," which looks at mechanisms to increase performance of machine learning calculations in distributed systems like Hadoop and Spark. The authors have built a prototype atop of Spark using Asynchronous Sideways Information Passing (ASIP), which has different characteristics from the Bulk Synchronous Parallel model typically used. The paper describes some of the challenges of the implementation and describes the performance.

http://blog.acolyer.org/2015/11/26/asip/

The MapR blog has a brief introduction to pyspark, the Python bindings for Apache Spark.

https://www.mapr.com/blog/using-python-apache-spark

The upcoming Apache Spark 1.6 has support for directly querying the contents of a file without first creating a table. This doc has some examples of using the feature.

https://docs.cloud.databricks.com/docs/spark/1.6/examples/query.files.sql.html

News

The SystemML project, which is a large-scale machine learning framework with support for Hadoop and Spark execution models, has been accepted into the Apache Incubator. SystemML was open-sourced by IBM earlier this year.

http://www.ibm.com/blogs/think/2015/11/24/introducing-a-universal-translator-for-big-data-and-machine-learning/

Apache: Big Data North America is May 9-12, 2016 in Vancouver, Canada. The Call for Proposals is open now through February 12th.

http://events.linuxfoundation.org/events/apache-big-data-north-america/program/cfp

Releases

Version 0.3.4 of Schedoscope, the scheduling framework for Hadoop data warehouses, was recently released. The new version adds support for Hive 1.1.0, is based on Scala 2.11, and includes major performance improvements.

https://github.com/ottogroup/schedoscope/releases/tag/release-0.3.4

Apache Kafka 0.9.0 was released this week. The Confluent blog has a summary of the major work in the release (there were over 500 Jira issues resolved), which include security, Kafka Connect (for copying data in and out of Kafka), a new consumer API, and user-defined quotas (on a per-client basis). The new version also drops support for Java 6 and Scala 2.9.

http://www.confluent.io/blog/apache-kafka-0.9-is-released

The 1.3 version of Apache Drill was released this week with several new features. Highlights include enhanced S3 support, heterogeneous type support, header parsing for text files, and support for sequence files.

https://drill.apache.org/blog/2015/11/23/drill-1.3-released/

On the heels of the recent 0.10.0 release, Apache Flink announced the 0.10.1 bugfix release. It's a recommended upgrade for all users, and it resolves over 20 issues.

http://flink.apache.org/news/2015/11/27/release-0.10.1.html

Cloudera released the second beta of Kudu, the new storage engine for Hadoop. Version 0.6.0 contains changes to the Java client, new commands in the kudu-admin tool, support for single-node development on OS X, and more.

http://getkudu.io/releases/0.6.0/docs/release_notes.html

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Streaming Data Analytics: Next-Generation Big Data Techniques (San Francisco) - Tuesday, December 1
http://www.meetup.com/sfdata/events/226579688/

Big Data Application Meetup (Palo Alto) - Wednesday, December 2
http://www.meetup.com/Apex-Bay-Area-Chapter/events/226817171/

Apache Eagle: Secure Your Hadoop Data (San Jose) - Thursday, December 3
http://www.meetup.com/Big-Data-Security-and-Data-Governance-Meetup/events/226914667/

Baidu and Spark (Sunnyvale) - Thursday, December 3
http://www.meetup.com/spark-users/events/226686232/

Oregon

SnappyData: Real Time Operational Analytics with Apache Spark! (Portland) - Tuesday, December 1
http://www.meetup.com/Hadoop-Portland/events/226330909/

Arizona

Uniting Spark and Hadoop: The One Platform Initiative (Scottsdale) - Wednesday, December 2
http://www.meetup.com/Phoenix-Hadoop-User-Group/events/226337169/

Texas

American Airlines, Datameer and Cloudera (Fort Worth) - Thursday, December 3
http://www.meetup.com/DFW-BigData/events/226612303/

Illinois

Continuous Data Management for Hadoop and Spark: On-Premise or in the Cloud (Chicago) - Tuesday, December 1
http://www.meetup.com/Big-Data-Developers-in-Chicago/events/226883184/

District of Columbia

IBM Lights the Spark in DC (Washington) - Thursday, December 3
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/226612671/

Massachusetts

Continuous Data Management for Hadoop and Spark: On-Premise or in the Cloud (Boston) - Thursday, December 3
http://www.meetup.com/Open-Source-Analytics-Boston/events/226759308/

CANADA

Spark Technology Discussion & Demo (Kitchener) - Tuesday, December 1
http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226999515/

UNITED KINGDOM

Big Trouble: Getting into the Flow of Hadoop Testing (London) - Monday, November 30
http://www.meetup.com/Women-Who-Code-London/events/225366423/

SWEDEN

Google Cloud Dataproc & the Network Behind the Elephant (Stockholm) - Wednesday, December 2
http://www.meetup.com/stockholm-hug/events/226986550/

FINLAND

Lauri Niskanen: A Recommendation System Illustrated with Spark (Tampere) - Tuesday, December 1
http://www.meetup.com/Tampere-Data-Science/events/226052588/

POLAND

MUG #1 - Mesos Fundamentals (Warsaw) - Friday, December 4
http://www.meetup.com/Warsaw-Mesos-User-Group/events/226380947/

CROATIA

Streaming Data with Apache Kafka (Zagreb) - Wednesday, December 2
http://www.meetup.com/Apache-Spark-Zagreb-Meetup/events/226394442/

ISRAEL

Apache Spark in the Cloud, Fighting World Hunger (Tel Aviv-Yafo) - Tuesday, December 1
http://www.meetup.com/israel-spark-users/events/226854594/

CHINA

Shanghai Big Data Streaming 2nd Meetup (Shanghai) - Sunday, December 6
http://www.meetup.com/Shanghai-Big-Data-Streaming-Meetup/events/226970213/

SINGAPORE

Spark Meetup during Strata! (Singapore) - Tuesday, December 1
http://www.meetup.com/Spark-Singapore/events/219039180/

Meetup @ Strata with Doug Cutting, Ted Dunning + more (Singapore) - Wednesday, December 2
http://www.meetup.com/BigData-Hadoop-SG/events/226577138/

Strata Community Event: Productionizing Data Science at Scale (Singapore) - Thursday, December 3
http://www.meetup.com/DataScience-SG-Singapore/events/226456676/

AUSTRALIA

Apache Flink and NiFi (Melbourne) - Tuesday, December 1
http://www.meetup.com/HadoopMelbourne/events/226664315/