Sunday, July 3, 2016

Spark Fundamentals, Spark Core , Spark History,Spark RDD

Why do we require Spark?
1. Data will be in one machine is very difficult to process , and it will be increase day by day
1. Easy Readability
2. Expressiveness
3. Fast
4. Testability
5. Interactive
6. Fault Tolerant
7. Unify Big Data
Spark overview
1. Basics of Spark
2. Core API
3. Cluster Managers
4. Spark Maintenance
1. SQL
2. Streaming
3. MLib/GraphX
Basics of Spark
1. Hadoop
2. History of Spark
3. Installation
4. Big Data’s Hello World

If we want to run streaming data, then we need STORM. Like this we need different frame works to run the different big data items like Hive, Scalding, HBase , Apache DRILL, Flume,mahout and Apache GIRAPH to unify all these things Spark came into picture.

1. Spark is a unified flatform for Big Data
2. It originates from core libraries

Abstractions FTW
Hadoop MR will take – 110000 lines of code
Impala will take – 90000 lines of code
Strom will take – 70000 lines of code
Giraph – 60000 lines of code
Finally Spark will take all together – 80000(includes Spark core- 40000+Spark SQL-30000+Streaming-6000+Graph X- 4000)

History of Spark
Hadoop – 2006
Spark – 2009
Spark paper – BSD Open Source – 2010
amp Labs – 2011
Databricks -2013
Given to Apache – 2013
Top Level Downloaded and in apache – 2014
Databricks== Stability
Every three months , they will have releases.

Who is using Spark
Over 500 companies using Spark
Like PANDORA, NETFLIX, OOYALA, Goldmansachs, ebay, yahoo,conviva,hhmi and jannelia for healthcare
Spark Installation

Check at

Spark Languages
We can write more than one language to write Scala applications
1. Scala
2. Java
3. Python
4. R

Hello Big Data
Word count example in

Big Data
1. IOT – internet of things- fairly large amount of data.
2. Spark unified data flatform.
Spark Logistics
Developer API
Alpha Component
Unit testing is Very easy in Spark

1. Amplabs- for big data-moores –law-means-better decisions
2. Chrisstucchio- Hadoop_hatred
3. Aadrake-command-line-tools-can_be-235X-fatser-than-your-hadoop-cluster
4. Quantified-spark-unit-test
Apache Spark You tube channel.

Spark Core

Spark Maintainers
1. Matei Zaharia
2. Reynold Xin
3. Patric Wendell
4. Josh Rosen

Core API
1. Appify
2. RDD( Resilient Distributed Dataset)
3. Transforming the data
4. Action

Spark Mechanics
1. Driver- Spark Context (It is distributer across workers)

1. Executor - Task
2. Executor -Task
3. Executor Task

Spark Context
a. Task creator
b. Scheduler
c. Data locality
d. Fault Tolerance

Resilient Distributed Dataset
DAG- Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing

a. Map
b. Filter

a. Collect
b. Count
c. Reduce
RDD is immutable. Once created we can’t change
Every Action is fresh submit.

Input – How to load the data
1. Hadoop HDFS
2. File System
3. Amazon S3
4. Data bases
5. Cassandra
6. In Memory
7. Avro
8. Parquet
Lambdas-Anonymous functions
Named Method
def addOne(Item:int)={
Val intList = List(1,2)
for( item <- intList) yield {

Using lambda function
Val intList = List(1,2)>{
}) }//List(2,3)

We can minimise the above code like below
Val intList = List(1,2)>item+1)//List(2,3)

RDD will have 2 types of methods
a.A method used to take our existing data set run with provided function and it transform into another required shape
b.If any method returns another RDD, then it is transformation.

Map – Distributed across Nodes like Node1, Node 2 and Node N
The Given function will execute from all the nodes

      For (item <- items) {
           Yield mapFunction(item)
mapFunction- transformation function. This is repeated across all the nodes. Instead of repeating all the same data in all nodes , we can avoid configuring mapItemsFunction(items) Ex:instead of creating DB connection each node we will have single DB connection. RDD Combiners We will have mongoDB RDD1 and HDFS file System RDD2, to combine both RDD’s we can use UNION to combinedRDD We can use ++ operator to combine two RDD’s Intersection -RDD1.intersection(RDD2), to get the distinct values from 2 RDDS. Substract - one RDD have only unique values to another RDD Cartesian – One RDD will take each element in another RDD will compare with all possible RDD pairs Zip- Both RDDS should match same no of elements and same number of partitions 2. Actions a. Transformations are lazy and keep the data as distributed as possible. b. Actions typically sent results back to the Driver. 1. Associative Property 2+4+4+7 we can add this values in one go or (2+4) + (4+7) It is nothing but, however we are doing the action, result should be same. Acting on Data Data is distributed on different clusters, If we collect all the data and send to driver, there may be out of memory exceptions. Instead of that we can use take(5), each time once 5 records moved to Driver for computation , then again 5 records will take and send to driver and Driver keep it in Array format for final computation. Persistence Saving data no need to go to Driver. It can directly Store into any DB like 1. Cassandra 2. mongo DB 3. hadoop HDFS 4. AMAZON REDSHIFT 5. MySQL To save the Data we can use different formats 1. saveAsObjectFile(path) 2. saveAsTextFile(path) 3. ExternalConnector 4. Foreach(T => unit) foreachPartition(Iterator[T]=>unit) - Thank you very much for viewing this post.


Contact Form


Email *

Message *