Introduction to Apache Spark

What is the need of spark? 

  1. Hadoop MapReduce is limited to Batch Processing.
  2. Apache Storm/S4 is limited to real-time Stream Processing.
  3. Apache Impala/Tez is limited for Interactive Processing.
  4. Neo4j/Apache Giraph is limited to Graph Processing.

Hence there was no powerful engine in the industry that can process the data in real-time (streaming) as well as in batch mode. Also, there was a requirement that one engine can respond in sub-seconds and perform in-memory analytics.

Therefore, Apache Spark enters, it is a powerful open source engine provides real-time streaming, interactive, graph, in-memory as well as batch processing with speed, ease of use and sophisticated analytics.

What is Apache Spark?

Apache Spark is a general purpose & lightning fast cluster computing system. It provides high level APIs. Apache Spark is written in Scala language but it also provides APIs for Java, Python, R. Spark is 100 times faster than Big Data Hadoop and 10 times faster than accessing data from disk.

Spark Features 

  1. Speed.
  2. Ease of use.
  3. Low latency.
  4. Integration with Hadoop.
  5. Rich set of operators.
  6. Fault tolerant.
  7. Generalised execution model.

Apache Spark Architecture

Spark is a open-source distributed framework having a very simple architecture with only two nodes i.e., Master node and Worker nodes. Here is the architecture of Spark.


Every Spark application requires a SparkContext. It is the main entry point for Spark application. It interacts with cluster manager and specify Spark how to access the cluster. RDDs are also created using the SparkContext.

In the worker nodes, there is something called task where the actual execution happens. In the distributed computing, computing of a job is split up into different stages each stage is called as a task. Each JVM inside the worker machine executes each task. Similarly, in the Spark architecture also Worker node contains the executor which carries out these tasks.

In the middle there comes the cluster manager. Cluster manager is used to handle the nodes present in the cluster. Storing the data in the nodes and scheduling the jobs across the nodes everything is done by the cluster managers. Spark gives ease in these cluster managers also. Spark can run on 3 types of cluster managers. Spark can run on YARN (Native Hadoop cluster manager), can run on Apache MESOS, has its own cluster manager as well. Spark can use any of these three as its cluster manager. Spark can
run in local mode too.

RDD Operations

Resilient, i.e., if data is lost, it will be recreated automatically (Fault Tolerant).

Distributed, data residing on multiple nodes in a cluster.

Dataset, is a collection of partitioned data.

Resilient Distributed Dataset (RDD) is a fundamental data structure of spark. It is a immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.


Screenshot from 2018-07-29 09-17-32

The Answer is Simple. Data Sharing is Slow in MapReduce.

Screenshot from 2018-07-29 17-11-13.png

Let’s understand it in a bit more detailed way : 

  • In most current frameworks, the only way to reuse data between computations (Ex − between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS).
  • Data sharing is slow in MapReduce due to replication, serialization, and disk IO.
  • Regarding storage system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.


Iterative Operations on MapReduce

Here come RDDs to the rescue :

  • Recognizing this problem, Resilient Distributed Datasets (RDD) supports in-memory processing computation.
  • This means it stores the state of memory as an object across the jobs and the object is shareable between those jobs.
  • Data sharing in memory is 10 to 100 times faster than network and Disk.

Iterative Operations on RDD
pastedImage0 (1)

There are two types RDD operations:

  1. Transformations and
  2. Actions

Transformations — Spark Transformation is a function that produces new RDD from existing RDDs. It takes RDD as input and produces one or more RDD as output. Transformations are Lazy in nature i.e., they get executed when we call an action. They are not executed immediately.

Actions — Actions are operations that return a result to the driver program or write it to Storage.

Some of the actions in spark include:

  • Reduce
  • Collect
  • Count
  • First
  • Take
  • TakeSample
  • CountByKey
  • SaveAsTextFile

Spark Ecosystem 

Image result for spark ecosystem

Spark will only do computation, it will never store data. Data can be stored in Hadoop, HBase, Cassandra, S3, Azure Blob Storage etc.,.

Spark Core is the execution engine for the spark platform. It is generalised execution model to support a wide variety of applications.

Spark SQL enables users to submit SQL/HQL queries on the top of the spark. It provides engine for Hive data that enables unmodified Hive queries to run upto 100x faster.

Spark Streaming represents streaming data using discretised streams(DStreams), which periodically create RDDs containing data that came in during the last time window.

Spark MLib handles machine-learning models used for transforming datasets, which are represented as RDDs or DataFrames.

Spark GraphX is a graph computation engine built on top of Spark that enables users to process graph data at scale.


Leave a Reply