Of lakes and sparks – How Hadoop 2 got it right
Apache Spark
Apache Spark [7] is an exciting project that provides enhanced Hadoop MapReduce capability. First, Spark is a great in-memory parallel processing tool. It is a fast and general-purpose cluster computing system and provides high-level APIs in Java, Scala, and Python and an optimized engine that supports general execution graphs. It also supports a rich set of higher level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Second, Spark offers more than 80 high-level operators that make it easy to build parallel applications. It can be used interactively from the Scala and Python shells. Spark can be run in a standalone cluster mode, on EC2, or on Hadoop YARN or Apache Mesos. It can read from HDFS, HBase, Cassandra, and any Hadoop data source.
Finally, many programs run up to 100 times faster than Hadoop batch MapReduce jobs when using in-memory processing and about 10 times faster on disk – similar to the results obtained with Tez. It also is worth mentioning that Spark is part of the next generation Stinger Project to improve Hadoop Hive SQL. Then again, if you believe the hype in the article mentioned in the first paragraph, you might not learn that Hadoop version 2 is no longer a one-trick pony and is now the preferred platform for new tools designed to take advantage of the growing data lake.
Other than trying it out, there is not much more to mention about Spark and Hadoop. Spark does seem a bit easier to use than writing a Java application using the Hadoop MapReduce APIs, but the advantage is that the developer can decide what approach works best and then head out onto the data lake.
More to Come
Many people are surprised to learn the extent of Hadoop 2 application development, including many applications that were not possible with Hadoop version 1 and many examples of how to write an application that will run under Hadoop YARN. Much of the current confusion that surrounds Hadoop exists because Hadoop has transitioned from an application to a far-reaching platform on which applications like Spark can be implemented. The value of an open Big Data platform cannot be understated. Hadoop version 2 is the operating system for data lake clusters and a milestone in the evolution of data analysis.
Infos
- Jackson, J. "Hadoop successor sparks a data analysis evolution" IDG News Service, December 5, 2014, http://www.computerworld.com/article/2856063/enterprise-software/hadoop-successor-sparks-a-data-analysis-evolution.html
- Big Data characteristics: http://en.wikipedia.org/wiki/Big_data
- Big Data surprises: http://www.sisense.com/blog/big-data-surprises
- Rowstron, A., D. Narayanan, A. Donnelly, G. O'Shea, and A. Douglas. "Nobody ever got fired for using Hadoop on a cluster." In: 1st International Workshop on Hot Topics in Cloud Data Processing (Bern, Switzerland, Association for Computing Machinery, 2012), http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf
- Apache Tez: http://tez.apache.org/install.html
- Stinger project: http://hortonworks.com/labs/stinger/
- Apache Spark: https://spark.apache.org
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.
-
ZorinOS 17.1 Released, Includes Improved Windows App Support
If you need or desire to run Windows applications on Linux, there's one distribution intent on making that easier for you and its new release further improves that feature.
-
Linux Market Share Surpasses 4% for the First Time
Look out Windows and macOS, Linux is on the rise and has even topped ChromeOS to become the fourth most widely used OS around the globe.
-
KDE’s Plasma 6 Officially Available
KDE’s Plasma 6.0 "Megarelease" has happened, and it's brimming with new features, polish, and performance.
-
Latest Version of Tails Unleashed
Tails 6.0 is based on Debian 12 and includes GNOME 43.
-
KDE Announces New Slimbook V with Plenty of Power and KDE’s Plasma 6
If you're a fan of KDE Plasma, you'll be thrilled to hear they've announced a new Slimbook with an AMD CPU and the latest version of KDE Plasma desktop.
-
Monthly Sponsorship Includes Early Access to elementary OS 8
If you want to get a glimpse of what's in the pipeline for elementary OS 8, just set up a monthly sponsorship to help fund its continued existence.