Of lakes and sparks – How Hadoop 2 got it right
Apache Spark
Apache Spark [7] is an exciting project that provides enhanced Hadoop MapReduce capability. First, Spark is a great in-memory parallel processing tool. It is a fast and general-purpose cluster computing system and provides high-level APIs in Java, Scala, and Python and an optimized engine that supports general execution graphs. It also supports a rich set of higher level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Second, Spark offers more than 80 high-level operators that make it easy to build parallel applications. It can be used interactively from the Scala and Python shells. Spark can be run in a standalone cluster mode, on EC2, or on Hadoop YARN or Apache Mesos. It can read from HDFS, HBase, Cassandra, and any Hadoop data source.
Finally, many programs run up to 100 times faster than Hadoop batch MapReduce jobs when using in-memory processing and about 10 times faster on disk – similar to the results obtained with Tez. It also is worth mentioning that Spark is part of the next generation Stinger Project to improve Hadoop Hive SQL. Then again, if you believe the hype in the article mentioned in the first paragraph, you might not learn that Hadoop version 2 is no longer a one-trick pony and is now the preferred platform for new tools designed to take advantage of the growing data lake.
Other than trying it out, there is not much more to mention about Spark and Hadoop. Spark does seem a bit easier to use than writing a Java application using the Hadoop MapReduce APIs, but the advantage is that the developer can decide what approach works best and then head out onto the data lake.
More to Come
Many people are surprised to learn the extent of Hadoop 2 application development, including many applications that were not possible with Hadoop version 1 and many examples of how to write an application that will run under Hadoop YARN. Much of the current confusion that surrounds Hadoop exists because Hadoop has transitioned from an application to a far-reaching platform on which applications like Spark can be implemented. The value of an open Big Data platform cannot be understated. Hadoop version 2 is the operating system for data lake clusters and a milestone in the evolution of data analysis.
Infos
- Jackson, J. "Hadoop successor sparks a data analysis evolution" IDG News Service, December 5, 2014, http://www.computerworld.com/article/2856063/enterprise-software/hadoop-successor-sparks-a-data-analysis-evolution.html
- Big Data characteristics: http://en.wikipedia.org/wiki/Big_data
- Big Data surprises: http://www.sisense.com/blog/big-data-surprises
- Rowstron, A., D. Narayanan, A. Donnelly, G. O'Shea, and A. Douglas. "Nobody ever got fired for using Hadoop on a cluster." In: 1st International Workshop on Hot Topics in Cloud Data Processing (Bern, Switzerland, Association for Computing Machinery, 2012), http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf
- Apache Tez: http://tez.apache.org/install.html
- Stinger project: http://hortonworks.com/labs/stinger/
- Apache Spark: https://spark.apache.org
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.
-
Plasma Desktop Will Soon Ask for Donations
The next iteration of Plasma has reached the soft feature freeze for the 6.2 version and includes a feature that could be divisive.
-
Linux Market Share Hits New High
For the first time, the Linux market share has reached a new high for desktops, and the trend looks like it will continue.
-
LibreOffice 24.8 Delivers New Features
LibreOffice is often considered the de facto standard office suite for the Linux operating system.
-
Deepin 23 Offers Wayland Support and New AI Tool
Deepin has been considered one of the most beautiful desktop operating systems for a long time and the arrival of version 23 has bolstered that reputation.
-
CachyOS Adds Support for System76's COSMIC Desktop
The August 2024 release of CachyOS includes support for the COSMIC desktop as well as some important bits for video.
-
Linux Foundation Adopts OMI to Foster Ethical LLMs
The Open Model Initiative hopes to create community LLMs that rival proprietary models but avoid restrictive licensing that limits usage.
-
Ubuntu 24.10 to Include the Latest Linux Kernel
Ubuntu users have grown accustomed to their favorite distribution shipping with a kernel that's not quite as up-to-date as other distros but that changes with 24.10.
-
Plasma Desktop 6.1.4 Release Includes Improvements and Bug Fixes
The latest release from the KDE team improves the KWin window and composite managers and plenty of fixes.
-
Manjaro Team Tests Immutable Version of its Arch-Based Distribution
If you're a fan of immutable operating systems, you'll be thrilled to know that the Manjaro team is working on an immutable spin that is now available for testing.