Recently there was an announcement by the Apache Software Foundation about the first production-ready release of Spark for the Hadoop data-processing platform. Originally developed five years ago at the University of Berkeley's AMPLab (Algorithms, Machines and People Lab), Spark is an impressive open source in-memory processing engine built around speed, ease of use, and advanced analytics. It can be deployed with Hadoop or independently.

Spark is an alternative to MapReduce – instead of running jobs in long batch modes, it runs jobs in bursts of short batches. Its key benefits come from reliable caching of intermediate data in memory as opposed to writing to disk every time. Through its ecosystem components, Spark can enable:

-          SQL Queries (Shark)

-          Streaming Analytics (Spark Streaming)

-          Machine Learning Library (MLLib)

-          Graph computation jobs (GraphX)

So how are we seeing it deployed amongst industry leaders?  Some examples of its real time usages include:

  • Machine-generated data collection and analysis, especially where data has to be joined from multiple sources
  • Stream Processing such as log analysis of live streams of alerts. 
  • Social data analysis
  • Recommendation engines

With Spark, applications running in Hadoop clusters are able to run as much as 100x faster in memory. Additionally, it helps clusters on a disc run up to 10x faster.  Applications can be written in Java, Scala, or Python.  You can read more about Apache Spark at Cloudera or Databricks.


Join us at Hadoop Summit 2014 this week to learn about some of the cool things Dell is doing! We're in Booth G14.  And attend the Dell, Cloudera and Intel Fireside Chat, Thursday, June 5th, 12:35-1:20 to discuss Hadoop tuning and real world benchmarking.