Apache Spark

FAQ

Article from Issue 201/2017
Author(s):

Spread your processing load across hundreds of machines as easily as running it locally.

Q: Apache Spark? I've wanted to set fire to my Apache web server more than a few times – usually when I'm elbows-deep in a config file that just refuses to work as I want it to. Is that what it's for?

A: I've been there, too, but no. The web server commonly know as Apache is officially called the Apache HTTP Server. The Apache Software Foundation manages hundreds of projects, and only a few of them are related to web servers. Apache Spark is a programming platform (Figure 1).

Figure 1: The project website (spark.apache.org) has all the information you need to get started.

Q: But, there are hundreds of programming platforms already. What does Apache Spark have that others don't?

A: To understand Spark, you have to understand the problem that it's designed to solve: Processing Big Data.

Q: Ah, Big Data! The buzzword du jour of the computing industry. Let's start with the obvious Question: Just how big does Big Data have to be to need Spark? Gigabytes? Terabytes? Petabytes … er … jibblywhatsitbytes?

A: With Big Data, it's better not to think of the size in terms of absolute numbers. For our purposes, we'll say that data becomes big once there's too much to process on a single machine at the speed you need it. All Big Data technologies are based around the idea that you need to coordinate your processing across a group of machines to get the throughput you need.

Q: What's so hard about that? You just shuffle the data around to different machines, run the same command on them, and then you're done. In fact, give me a minute; I think I can put together a single command to SCP data to other machines and then use SSH to run the commands on them.

A: You don't even need to write a complex command if you just want to load balance data processing across various machines, the Gnu Parallel tool has support for that baked in. Although this approach can work really well for some simple examples, (e.g., if you want to recompress a large number of images), it can very Quickly get complex or even impossible. For example, if the processing of the data involved uses the results of some other aspects of the processing.

Q: It sounds like you're just making up problems now. What real-world issues does this help with?

A: Probably the best example of a problem that Spark solves is machine learning.

Q: As in artificial intelligence?

A: Yep. Machine learning can work in many ways, but consider this one. You push data through an algorithm that tries to intuit something about the data. However, at the same time, it's learning from the data that it sees. The two separate items, learning and processing, are happening at the same time. For this to work, there has to be a link between the separate computers doing the processing (so that they can all learn in the same way), but at the same time, they want to distribute their processing as much as possible.

Q: This sounds a bit like black magic. How does Spark manage to keep a consistent, yet changing model across multiple machines?

A: The concept that sits at the heart of the Spark platform is Resilient Distributed Datasets (RDDs). An RDD is an immutable collection of data. That means that once it's created, an RDD doesn't change. Instead, transformations that happen to an RDD create a new RDD. The most basic use of RDDs is something you may be familiar with: MapReduce.

Q: Ah, yes. I've heard of that. I can't remember what it is though!

A: Very simply, MapReduce is another paradigm for processing Big Data. It goes through every item in your dataset and runs a function on it (this is the map), and then it combines all these results into a single output (this is the reduce). For example, if you had a lot of images and you wanted to balance their brightness and then create a montage of them, the map stage would be going through each image in turn, and the reduce stage would be bringing each balanced image together into a montage.

Spark is heavily inspired by MapReduce and perhaps can be thought of as a way to expand the MapReduce concept to include more features. In the above example, there would be three RDDs. The first would be your source images, the second would be created by a transform that balances the brightness of your images, and the third would be created by the transform that brings them all together to make the montage.

Q: Ah, OK. How does this work with artificial intelligence again?

A: Well, there's two answers to that. The first is complex and doesn't fit in a two page FAQ; the second is: "don't worry about that, someone's done it for you."

Q: Ah, there's a library to use?

A: To call Spark's MLlib a machine learning library seems to undersell it, but because that's what it's actual name stands for, we can't really argue with it. Essentially, it is a framework that comes complete with all the common machine learning algorithms ready to go. Although it does reQuire you to do some programming, 10 lines of simple code is enough to train and run a machine learning model on a vast array of data split over many machines (Figure 2). It really does make massively parallel machine learning possible for almost anyone with a little programming experience and a few machines.

Figure 2: Distributed machine learning in 10 lines of Python (plus a few comments). It doesn't get much easier than this.

Q: OK, so that's machine learning. Are there any other tasks that Spark really excels at?

A: There's nothing else that Spark makes Quite as easy as machine learning, but one other area that is gaining popularity is stream data processing. This is where you have a constant flow of data in and you want to process it as soon as it comes in (and typically send it on to some time-series database). The Spark Structured Stream framework makes it easy to perform just this sort of processing.

It's wrong to think of Spark as just for machine learning and streaming, though. These are just two areas that happen to have frameworks on Spark. The tool itself is general-purpose and can be useful for almost any distributed computing.

Q: So, all I need to create my own streaming machine learning system is a couple of machines, a Linux distro, and Spark?

A: Not Quite. You also need some way of sharing the data amongst the machines. Hadoop's Distributed File System (HDFS) is the most popular way of doing this. This provides a way of handling more data than any one machine can handle and efficiently processing it without moving it between computers more than is necessary. HDFS also provides resilience in case one or more machines break. After all, the more machines you've got, the higher the chances are that one breaks, and you don't want your cluster to go down just because one machine has some issues.

Q: Right. I'm off to build a business empire built on computers that are more intelligent than I am.

A: Good luck.

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Hadoop 2 and Apache Spark

    Hadoop version 2 has transitioned from an application to a Big Data platform. Reports of its demise are premature at best.

  • Tutorials – Apache Spark

    Churn through lots of data with cluster computing on Apache's Spark platform.

  • Tutorials – Apache Spark Supercomputer

    Complete large processing tasks by harnessing Amazon Web Services EC2, Apache Spark, and the Apache Zeppelin data exploration tool.

  • Inside Apache

    Apache incubates hundreds of major software projects and brings together thousands of developers – all without ensuing chaos. How do they manage it?

  • Data Management

    Open source database management systems offer greater flexibility and lower costs while avoiding vendor lock-in. Finding the right one depends on your project's needs.

comments powered by Disqus