A Spark in the Cloud

Article from Issue 203/2017

Author(s): Ben Everard

Complete large processing tasks by harnessing Amazon Web Services EC2, Apache Spark, and the Apache Zeppelin data exploration tool.

Last month I looked at how to use Apache Spark to run compute jobs on clusters of machines [1]. This month, I'm going to take that a step further by looking at how to parallelize the jobs easily and cheaply in the cloud and how to make sense of the data it produces.

Both of these tasks are somewhat interrelated, because if you're going to run your software in the cloud, it's helpful to have a good front end to control it, and this front end should provide a good way of analyzing the data.

Big Data is big business at the moment, and you have lots of options for controlling Spark running in the cloud. However, many of these choices are closed source and could lead to vendor lock-in if you start developing your code in them. To be sure you're not tied down to any one cloud provider and can always run your code on whatever hardware you like. I recommend Apache Zeppelin as a front end. Zepplin is open source, and it's supported by Amazon's Elastic Map Reduce (EMR), which means it's quick and easy to get started.

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Download Article PDF now with Express Checkout

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subscriptions

Digital Subscriptions