Of lakes and sparks – How Hadoop 2 got it right

Misconceptions

Article from Issue 172/2015
Author(s):

Hadoop version 2 has transitioned from an application to a Big Data platform. Reports of its demise are premature at best.

In a recent story on the PCWorld website titled "Hadoop successor sparks a data analysis evolution," the author predicts that Apache Spark will supplant Hadoop in 2015 for Big Data processing [1]. The article is so full of mis- (or dis-)information that it really is a disservice to the industry. To provide an accurate picture of Spark and Hadoop, several topics need to be explored in detail.

First, like any article on "Big Data," is it important to define exactly what you are talking about. The term "Big Data" is a marketing buzz-phase that has as much meaning as things like "Tall Mountain" or "Fast Car." Second, the concept of the data lake (less of a buzz-phrase and more descriptive than Big Data) needs to be defined. Third, Hadoop version 2 is more than a MapReduce engine. Indeed, if there is anything to take away from this article it is the message in Figure 1. And, finally, how Apache Spark neatly fits into the Hadoop ecosystem will be explained.

Figure 1: Hadoop version 2 is much more than MapReduce.

What Is Big Data?

Big Data, as the name implies, suggests large-volume data processing – often measured in petabytes (PB). According to Wikipedia, several characteristics define Big Data [2]:

  • Volume – Large volumes clearly define Big Data. In some cases the sheer size of the data makes it impossible to evaluate with conventional means.
  • Variety – Data may come from a variety of sources and not necessarily be "related" to other data sources.
  • Velocity – This term in the Big Data context refers to how fast the data can be generated and processed.
  • Variability – Data may be highly variable, incomplete, or inconsistent.
  • Complexity – Relationships between data sources may not be entirely clear and not amenable to traditional relational methods.

Organizations may have several of the data processing needs mentioned above and still not require the processing of large volumes of data. The notion that all companies are siting on petabytes of un-analyzed data is not necessarily true. Consider a blog post [3] in which the author mentions research that indicates Big Data's sweet spot starts at 110GB and that the most common amount of data the average company has under management is between 10 and 30TB.

Additionally, the paper "Nobody Ever Got Fired for Using Hadoop on a Cluster" [4] points out that at least two analytics production clusters at Microsoft and Yahoo! have median job input sizes of less than 14GB and that 90 percent of jobs on a Facebook cluster have input sizes of less than 100GB.

A better description of Big Data processing is high-performance data processing (HPDP), because the characteristics mentioned above need some form of high-performance computing to achieve their goal. This designation is similar to high-performance technical computing (HPTC), which is often referred to as HPC or Supercomputing. Furthermore, an argument can be made that HPDP and HPTC are really the same things or at least have significant overlap. For now, I'll limit the discussion to HPDP.

Discovering the Data Lake

Before I look at how one might process Big Data, I'll take a look at where the data is stored. One of the features not mentioned above, but certainly implied, is the concept of a central storage depot for all data. Although not all data may be amenable to a relational database, it will need to be stored in raw form. This characteristic is what often distinguished HPDP from more traditional methods. Often referred to as a "data lake," the idea is to create a vast repository for all data and use it as needed.

Contrast this approach with that of a traditional relational database or data warehouse. Adding data to the database requires that it be transformed into a predetermined schema before it can be loaded into the database. This step is often called Extract, Transform, and Load (ETL) and can consume both time and money before the data can be used. Most importantly, decisions about how the data will be used must be made during the ETL step. Moreover, data that does not fit into the data schema or is deemed un-needed is often discarded in the ETL step.

Hadoop focuses on using data in its raw format. Essentially, what looks like the ETL step is performed when the data is accessed by Hadoop applications. This approach, "schema on read," allows programmers and users to enforce a structure to suit their needs when they access data. The traditional data warehouse approach, "schema on write," requires more upfront design and assumptions about how the data will eventually be used.

With respect to Big Data, as described above, the data lake concept offers three advantages over a more traditional approach:

  1. All data is available, with no need to make any assumptions about future data use.
  2. All data is sharable. Multiple business units or researchers can use all available data, some of which was not available because of data compartmentalization on disparate systems.
  3. All access methods are available. Any processing engine can be used to examine the data (e.g., MapReduce, graph processing, in-memory tools, etc.).

To be clear, Hadoop is not destined to replace data warehouses. Data warehouses are valuable business tools; however, the traditional data warehouse technology was developed before the data lake began to fill. The growth of new data sources from disparate sources, including social media, click trails, sensor data, and others, are starting to fill the data lake. Of course, there will be overlap, and each tool will address the need for which it was designed.

Hadoop is often compared with a data warehouse in terms of an either-or choice. One of the primary criticisms of Hadoop is the slow batch nature of MapReduce jobs. Concerns about Hadoop version 1 speed are valid in some cases; however, as I explain later in this article, Hadoop version 2 provides a much more flexible and expansive platform for HPDP.

The difference between a traditional data warehouse and Hadoop is depicted in Figure 2. In the figure, different data (purple) can be seen entering either an ETL process or a data lake. The ETL process places the data in a schema as it stores (writes) the data to the relational database, whereas the data lake just stores the raw data. When a Hadoop application uses the data, the schema is applied when it reads the data from the lake. Note that the ETL step often discards some data as part of the process.

Figure 2: The traditional data warehouse compared with the Hadoop data lake.

Apache Hadoop 2 Lake House

Another point to notice in Figure 2 is the "Hadoop" box that operates on the data needed by the user. In Hadoop version 1, this step was limited to a parallel MapReduce engine. Popular packages, such as Pig or Hive SQL, were built on top of this engine. In version 2, however, Hadoop has become a cluster operating system that provides a platform to build data lake applications. Hadoop version 2 includes a MapReduce framework that is compatible with version 1 MapReduce and most version 1 applications will run without change (including Pig and Hive).

Hadoop version 1 has two main components:

  • Hadoop Distributed File Systems (HDFS) – the data lake for the cluster.
  • Monolithic MapReduce engine – manages MapReduce jobs and workflow.

Hadoop version 2 has expanded on this design and added much more flexibility to a Hadoop installation. Hadoop now has three main components:

  • HDFS – the same data lake as version 1.
  • Hadoop YARN (Yet Another Resource Negotiator) – an application-agnostic workload scheduler that manages user jobs.
  • Applications – any application that runs under YARN and takes advantages of "YARN services" to access cluster resources and use the data lake. These applications can include any type of processing – MapReduce, graph processing, message-passing interface (MPI) applications, and even, as will be explained below, in-memory applications like Apache Spark.

One design feature of both Hadoop versions 1 and 2 is the use of a simple redundancy model. Most Hadoop clusters are constructed from commodity hardware (x86 servers, Ethernet, and hard drives). Hardware is assumed to fail and thus processing and storage redundancy are part of the top-level Hadoop design.

Because the MapReduce process is "functional" in nature, data can only move in one direction. For instance, input files cannot be altered as part of the MapReduce process. This restriction allows for a very simple computation redundancy model where dead processes on failed nodes can be restarted on other servers with no loss of results (execution time may be extended, however).

Version 2 maintained this level of "non-stop" redundancy in the YARN scheduler. An application can request and release resources at run time. Thus, if a process fails, the application can request additional resources to complete the task. Additionally, because resource use is dynamic, applications can release un-needed (completed) resources so the overall cluster utilization can be increased. This situation is typical for large MapReduce jobs where the mapping phase often uses more processes than the reducer phase.

The redundancy also extends to HDFS. Unlike many HPTC clusters, in which parallel I/O is a sub-system and separate from the compute nodes, a Hadoop cluster uses commodity hardware for both processing and storage nodes. These nodes often serve as both storage and processing elements so "computation can be moved to the data." Thus, the data lake often spans the entire Hadoop cluster, and jobs can be thought of as fleets of fishing boats traversing the lake in search of results.

In a similar fashion, YARN can be considered the lake house where you need to request boats to fish on the lake. HDFS also can be easily configured with enough redundancy so that losing a node or rack of nodes will not result in job failure or data loss. This "non-stop design" has been a hallmark of Hadoop clusters and is very different from most HPTC systems in which the occasional hardware failure and subsequent job termination is acceptable.

Before I introduce Apache Spark, another project bears mentioning. The Apache Tez [5] project is a behind-the-scenes optimization of Hadoop MapReduce jobs. Most higher level tools, such as Hive SQL or Pig, create combinations of MapReduce jobs that are then executed on the cluster. Each MapReduce job operates separately and thus is reading and writing to and from HDFS (i.e., reading and writing to disk).

Tez combines these individual jobs and does away with the disk usage when possible and transfers data directly in-memory to the next phase of the MapReduce work flow. As part of the Stinger Project, Tez has helped improve the speed performance of Hive SQL jobs on Hadoop by factor of 50, and in some cases by a factor of 160 [6].

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus