Pages

Apache Spark is Hadoop's speedy Swiss Army knife

Saturday 31 May 2014

Fast-running data analysis system provides real-time data processing functions Hadoop has been pushed to incorporate


Apache Spark is Hadoop's speedy Swiss Army knife

Credit: Wikimedia
Hadoop, the data processing framework that's become a platform unto itself, is only as good as the components that plug into it. But the conventional MapReduce component for Hadoop has a reputation for being too slow for real-time data analysis.
Enter Apache Spark, a Hadoop data processing engine designed for both batch and streaming workloads, now in its 1.0 incarnation and outfitted with features that exemplify what kinds of work Hadoop is being pushed to encompass.
Spark's libraries are designed to complement the types of processing jobs being explored more aggressively with the latest commercially supported deployments of Hadoop. MLlib implements a slew of common machine learning algorithms, such as naïve Bayesian classification or clustering; Spark Streaming enables high-speed processing of data ingested from multiple sources; and GraphX allows for computations on graph data.
Another feature sported by Spark, Spark SQL, which is only in alpha at the moment, allows SQL-like queries to be run against data stored in Apache Hive. Extracting data from Hadoop via SQL queries is yet another variant of the real-time querying functionality springing up around Hadoop. Everyone from Pivotal to Splice Machine has offerings in this vein now, although the exact implementation varies widely, and they're not based (yet) on an existing open source standard.
The open-endedness of Hadoop is echoed further in the sheer number of the above in-demand functions -- stream processing, machine learning, and so on -- are now addressed by multiple products. In the case of machine learning, for example, Apache's Mahout project is designed to be a far more scalable and robust processing engine for such jobs than MLlib alone.
What Spark has to offer is bound to be a big draw for both users and commercial vendors of Hadoop. Users who make Hadoop into a default repository for data of all kinds (albeit with caveats) and who have already built many of their analytics systems around Hadoop are attracted to the idea of being able to use Hadoop as a real-time processing system.
Hadoop vendors, too, should be drawn to Spark 1.0 because it provides them with another variety of functionality to support or build proprietary items around. In fact, one of the big three Hadoop vendors, Cloudera, has already been providing commercial support for Spark since earlier this year via its Cloudera Enterprise offering. Hortonworks has also been offering Spark as a component of its Hadoop distribution, though only as a technology preview. Where those companies go from here with Spark is likely to be dictated as aggressively by their users -- and Spark's developers -- as it is by their business plans for Hadoop.

No comments:

Post a Comment