A Spark in the Cloud
A Spark in the Cloud
Complete large processing tasks by harnessing Amazon Web Services EC2, Apache Spark, and the Apache Zeppelin data exploration tool.
Last month I looked at how to use Apache Spark to run compute jobs on clusters of machines [1]. This month, I'm going to take that a step further by looking at how to parallelize the jobs easily and cheaply in the cloud and how to make sense of the data it produces.
Both of these tasks are somewhat interrelated, because if you're going to run your software in the cloud, it's helpful to have a good front end to control it, and this front end should provide a good way of analyzing the data.
Big Data is big business at the moment, and you have lots of options for controlling Spark running in the cloud. However, many of these choices are closed source and could lead to vendor lock-in if you start developing your code in them. To be sure you're not tied down to any one cloud provider and can always run your code on whatever hardware you like. I recommend Apache Zeppelin as a front end. Zepplin is open source, and it's supported by Amazon's Elastic Map Reduce (EMR), which means it's quick and easy to get started.
Although Zeppelin will work just as well running on your own infrastructure, I'm running it on Amazon because that's the easiest and cheapest way to get access to a large amount of computing power.
To begin, you'll need to set up an account with Amazon Web Services (AWS) [2]. Following this tutorial will cost you a little money, but it needn't be much (as you'll see in a bit). Working out exactly how much something will cost in AWS can be a little complex, and it's made even more difficult because you can get some services – up to a certain amount – for free. I'll try and keep everything simple (and cheap).
You have to pay for the machines you use. Although AWS has a mind-boggling number of different machines [3], rather than going too far into the details, I've found the m4.xlarge machine works well, and, as you scale up, it can be useful to use the m4.4xlarge or 10xlarge. The cost is a little difficult to predict. The basic on-demand price is $0.20 per hour; however, you don't actually need to pay this much, because AWS has a feature called "spot instances" that let you bid on unused capacity. With spot instances, you set a maximum price you're willing to pay, and if that's more than anyone else is willing to pay, you get the machines. A further complication is that instances can be in different data centers around the world. Because spot bids are per data center, you can often find cheaper machines by shopping around the different regions. You can see the current minimum price you need to pay for a spot instance in a particular region by going to the AWS website [2] then, in the box menu in the top left corner, selecting EC2. Under Instances in the left-hand menu, you'll see spot requests, and in the new page, you'll see Pricing History. At the time of writing, m4.xlarge machines are $0.064 in northern Virginia, but $0.023 in London. If you start doing large amounts of data processing using AWS, you'll need to pick in which region to store your data, and the availability of spot instances can be a key factor in this decision. Obviously, the saving here of $0.177 per hour isn't huge, but if you're using lots of machines that are each significantly more powerful (and expensive), the saving on spot instances can make a huge difference.
As a word of caution, spot instances are charged per hour, but if someone outbids you, they get the machine instantly, and you get cut off. The Hadoop platform that Spark runs on is quite resilient to this process, as long as you don't lose all the computers in the cluster. I'll look at how to stop this a bit later, but for now, I'll just say that it's best not to bid too close to the current spot price; otherwise, you're liable to lose your machine very quickly.
When you start up an EC2 machine, you get a bare Linux environment, on which you'll need to install a bunch of software. You could write a script that sets up everything for you, but it's far easier to let Amazon organize the work for you with EMR, which will set up everything you need on your machines. It's an additional $0.06 per machine per hour for m4.xlarge machines to use EMR, so the total cost to set up a simple two-machine cluster with EMR is $0.166 per hour.
EMR is under Analytics in the AWS box menu. On the EMR page, click on Create Cluster. You'll need to switch to Advanced Options (because you can't select Zeppelin under the quick options), then make sure that you have both Zeppelin and Spark checked, as well as the default options.
Under the Hardware tab, you can select the machines you want. EMR offers three different types of machines: Master, Core, and Task. The general advice for running stable clusters is to have a single Master machine, enough Core machines to run the job in the worst case, then as many Task machines as you want. With the Master and Core machines on an uninterruptable tariff (e.g., on-demand) and the Task machines as spot instances, your job won't be killed halfway through if someone outbids you, but it will finish sooner (and cheaper) if spot instances are available. However, for simple tests, I usually use spot instances for all my machines because I'm a cheapskate. In the web browser, you can delete the entire row for Core machines, then set the number and spot prices for the Master and Core machines.
On the next screen, turn off Logging and Termination Protection (both of these are more useful when you have a pre-defined job ready to run). Give your cluster a useful name and hit Next, then Create Cluster to set up your machines. It takes a few minutes for the machines to be set up and have all the software installed.
For security reasons, the cluster will be set up with everything locked down by a firewall, and you need to add a rule that allows you in. On the EMR Cluster screen (Figure 1), you should see Security groups for Master followed by a link. Clicking that link will take you to a new screen. Check the master security group and select Action | Edit inboud rules. Create a rule for your IP address (you can find this by visiting the What Is My IP Address site [4] followed by /32 for ports 0-65535. Back on the EMR screen, you can now click on the Zeppelin link to access the web UI (Figure 2).
From here, you can run your Spark code in your cluster from the web browser. Zeppelin code is organized into Notebooks, each of which contain "paragraphs" of code (the language is set at the start of the paragraph with %pyspark
for Python or %sql
for SparkSQL. The results of SQL queries are automatically transformed into charts. You can see an example of how to get started at Notebook | Zeppelin Tutorial.
Don't forget to terminate your EMR instances when you're finished, or you'll continue to be charged.
Infos
- "Tutorials – Apache Spark" by Ben Evarard, Linux Pro Magazine, isue 202, September 2017, pg. 89, http://www.linuxpromagazine.com/Issues/2017/202/Tutorials-Apache-Spark
- AWS: http://aws.amazon.com
- AWS machines: http://www.ec2instances.info
- What Is My IP Address: http://whatismyip.com
Buy this article as PDF
(incl. VAT)