Apache Spark-A Brief Overview

On November 10

Apache Spark-A Brief Overview

What is Apache Spark ?

  1. Meant for faster Cluster computing framework.
    • Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.
    • Enable to write application in Java, Scala or python.

  2. Big Data Processing framework
    • Provide comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature. Eg : Google – 100Pb, eBay – 100Pb,Facebook-600 Tb per day.
    • Spark is being adopted by major players like Amazon, eBay, and Yahoo!.

Hadoop and Spark

  1. Hadoop is one of the big data processing technologies present since 10 years.
  2. Use linear model to process data.
  3. Use MapReduce cycle and disk writes to process data hence slow.
  4. Spark provide InMemory data processing in Cluster.



Spark Features

  1. Spark holds intermediate results in memory rather than writing them to disk which is very useful especially when you need to work on the same dataset multiple times.
  2. Spark operators perform external operations when data does not fit in memory. 
  3. Provide High level API in Scala, Java and Python to improve developer productivity and consistent architecture model for Big Data processing.
  4. Till now fastest open source engine for sorting a petabyte.

Spark at Yahoo

  1. One for personalizing news pages for Web visitorsAnother for running analytics for advertising.
  2. Another for running analytics for advertising.
  3. Yahoo has two Spark projects in the works.
  4. To achieve this Yahoo (a major contributor to Apache Spark) wrote a Spark ML (machine learning) algorithm in Scala.
  5. With just 30 minutes of training on a large, hundred million record data set, the Scala ML algorithm was ready for business.



Yes, really goodone.

Could have been better.



Leave your comment