Apache Spark data structures - Data Structures designed for Lightning-Fast Cluster Computing
Apache Spark Framework is designed and developed to provide enterprise grade distributed processing of large data sets over Lightning-Fast Spark Cluster. Here design of Apache Spark Data Structures plays a major role in fast and efficient processing of very large data sets. In this tutorial we are going to understand all the data structures of Apache Spark Framework. Understanding these will help you in writing better and efficient program for your client.
Data Structure plays very important role in programming language and each programming language comes with data structure specific that programming language. Data Structures is any programming language is a format for data organization, management and storage; which provides functions and operations for efficient processing of data. The popular data structures in most of the programming language are array, collection, file, tree, list and many more.
Data Structures in Apache Spark Framework
First version of Spark framework was released with RDD data structure and now a days it supports following data structures:
- RDD - Spark RDD (Resilient Distributed
Dataset) is distributed collection of data over nodes in the
cluster.
- Dataframe - The Dataframe is next data
structures in Spark which is again a distributed collection of
data and here data is organized into columns. This data
structure is used with SparkSQL and provides sql like operations
over the data.
- Dataset - Dataset is strongly typed,
relational and column based data structure in Spark used with
SparkSQL.
- TUNGSTEN - The Tungsten is new data
structure introduced with the Spark 2.x (SparkSQL) to provide
high performance processing of data. The Tungsten is a product
of Project Tungsten which is a parallel project with Spark 2.x.
The Tungsten operations works on the byte-code level and by
optimizing Spark Jobs for CPU/Memory efficiency.
- Graphframe - GraphFrames are DataFrames based package for Apache Spark, which is used for Graph Processing in Apache Spark. The Graphframe is available for Java, Scala, Python and R programming languages.
All these data structures makes Apache Spark very powerful framework for processing data in Big Data environment. Let's see simple examples of these data sets in Spark Scala.
RDD - Spark RDD (Resilient Distributed Dataset)
Apache Spark Framework was started with the RDD data structure and now it provides 5 highly efficient data structures. The RDD stands for Resilient Distributed Dataset which is a distributed data set over the nodes in the Spark Cluster. Here are the features of RDD:
- Resilient: Apache Spark RDD is resilient
and in case of failure of one or more nodes it can be
recomputed.
- Distributed: The RDD is distributed dataset
which can be partitioned over several nodes in the cluster.
- Immutable: Once RDD is created it can't be
modified. If some modification is required then transformation
can be apply to generate new RDD.
- Lazy: Apache Spark RDD just represents the
data which is a result of computation. It does not trigger any
computation themselves, instead transformations and actions can
be applying for processing of RDD data which generates new RDD.
- Statically typed: Apache Spark RDD is statically types and has a type for example RDD[String] or RDD[Location] etc...
RDD datasets supports two type of operations:
- Transformations
- Actions
Now let's look at these one by one.
RDD supports following transformations:
- Map - This function map() is used to
iterate over every line in RDD
- FlatMap - This is mostly used in the split
operation and here for each input element we will get many
output elements in the output RDD.
- MapPartition - The MapPartition() function
is like Map() function and the difference is that the MapPartion()
runs separately on each partition (block)
- Filter - This is used for filtering input
RDD into new RDD based on certain criteria.
- Sample - The sample() method is used to
take sample data from RDD into in RDD. The takeSample() method
is also there for taking samples from the RDD data sets.
- Union - The union() operations are used to
merge data of one RDD with another RDD into a new RDD. This is
very useful if you have many two or more RDD and you want to
merge the data of these RDD into a new RDD. You easily do
it by rdd1.union(rdd2) code.
- Intersection - The
intersection() method is used to get the common
elements from two RDD into a new RDD. This is used when you want
to find the common elements from two RDD. You can find the
intersection with this code: rdd1.intersection(rdd2).
- Distinct - The distinct() is used to get
distinct data from the source RDD.
- ReduceByKey - The reduceByKey() method is
used to aggregate the data separately for each key in the RDD
and it uses join() method to merge two RDD together by grouping
elements with key. This method can be easily used for word count
in Apache Spark. The reduceByKey() operation is lazily
evaluated. Here is example code:
val words = Array("one","Sixteen","two","nine","five", "nine","Sixteen","four","nine","Sixteen","four") val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_) data.foreach(println)
- GroupByKey - The groupByKey() function of
RDD groups all the values in the data set by the key.
- Join - The join() operation is Spark is
used to combine data from two RDD into one rdd on the basis of
key.
- Cartesian - The cartesian() function is
used to get the Cartesian product of two RDD into another RDD.
- Repartition -
Apache Spark Framework splits the data into number of
partitions and then execute computations in parallel over
distributed cluster. The repartition() can be used to increase
or decrease the number of partitions for better performance of
application. You can call repartition() method on the RDD and
provide no of partitions as parameter e.g. val rdd2 = rdd1.repartition(5).
- Coalesce - The coalesce() function is used to reduce the number of partitions of RDD without a full shuffle of data. So, if we have to reduce the number of partitions then we should use a coalesce() method to avoid full re-shuffle of data.
RDD supports following actions:
- count - This function is use to count the
number of rows in RDD.
- collect - The collect() method is used to
get entire RDD to driver program. If the data size is huge then
its advisable to use it with care because it retrieves all the
RDD data on the driver node.
- take - The take() function returns n number
of RDD on the driver machine. If you want to return 10 rows then
use the function take(10).
- top - The top() method returns top n rows
from RDD using the default ordering.
- countByValue - The countByValue()
method is used to get the count of unique values in RDD as local
Map.
- reduce - The reduce() method in Spark RDD
is an action which performs aggregation on the RDD. The reduce()
function triggers execution of DAG and operations are performed
on full RDD over distributed cluster.
- fold - The fold() function is like reduce()
function but there is a little difference. The reduce() function
throws exception in case of empty collection while fold()
function does not.
- aggregate - The aggregate() function is
used when data type is different from the input type and this
function gives flexibility to developers.
- foreach - The foreach() function used to execute some operations on the each data item in RDD.
As we can see from the above details that Spark RDD dataset is very powerful and
it provides many ways to process data efficiently on the spark cluster.
Dataframe
The Dataframe is second data structures added to the Apache Spark framework and its a columnar data structure. In Dataframe we are organizing the data into columns and rows. It runs of the SparkSQL Context and provides SQL operations.
The Dataframe is not types and its main disadvantage is lack of type safety. Dataframe supports operations like printSchema(), select(), filte(), groupBy() etc... In future lessons you will find many examples of Dataframes in Spark framework.
Dataset
The Dataset is just like Dataframe with full type safety. In Dataset the data is strongly typed and it solves the type safety issue faced in Dataframe. The Dataset also runs in Apache SparkSQL context and provides sql query features.
TUNGSTEN
The Tungsten data structure is introduced in Apache Spark 2.x (SparkSQL). This data structure is designed to improve the performance of Spark application to match with the bare metal performance. The Tungsten Project is working towards improving the efficiency of memory and CPU for application developed in Spark. The Tungsten project initiative focuses on the following three areas:
- The Memory Management and Binary Processing to overcome the overhead of JVM.
- The Cache-aware computation: by developing algorithms and data structures for fast processing
- The Code generation: The runtime code generation to take full advantage of modern compilers and CPUs
This way Tungsten Project is working towards developing system for Apache Spark for fast processing to data.
Graphframe
The Graphframe is Apache Spark library for processing of large scale Graph Data on the distributed Spark cluster. This API is available for Java, Scala and Python programming languages. The GraphX module of Apache Spark Frameworks makes it possible for the running graph queries on the large data sets. The data from source can be loaded into GraphFrames and then perform various queries on the graph data sets. Since the processing is done in-memory and over the distributed cluster it provides very fast performance.
In Spark Graphframe can be initialized by passing vertices and edges data as constructor to GraphFrame() as shown below:
gf = GraphFrame(vertices, edges)
Once the GraphFrame is initialized you can run various graph queries on the data set.
In this section we have learned about all the Data structures provided by Apache Spark framework. This lesson presented details introduction to data structures in Apache Spark framework and little bit comparison between these data structures.
Check Tutorials at:
Apache Spark Framework
programming tutorial.