Apache Spark data structures

In this section I am going to explain you Apache Spark data structures used by Spark for distributed processing of data at Lightning-Fast speed over distributed cluster. Learning Apache Spark data structures gives you understanding about Spark programming model.

Apache Spark data structures

Apache Spark data structures - Data Structures designed for Lightning-Fast Cluster Computing

Apache Spark Framework is designed and developed to provide enterprise grade distributed processing of large data sets over Lightning-Fast Spark Cluster. Here design of Apache Spark Data Structures plays a major role in fast and efficient processing of very large data sets. In this tutorial we are going to understand all the data structures of Apache Spark Framework. Understanding these will help you in writing better and efficient program for your client.

Data Structure plays very important role in programming language and each programming language comes with data structure specific that programming language. Data Structures is any programming language is a format for data organization, management and storage; which provides functions and operations for efficient processing of data. The popular data structures in most of the programming language are array, collection, file, tree, list and many more.

Data Structures in Apache Spark Framework

First version of Spark framework was released with RDD data structure and now a days it supports following data structures:

  1. RDD - Spark RDD (Resilient Distributed Dataset) is distributed collection of data over nodes in the cluster.
     
  2. Dataframe - The Dataframe is next data structures in Spark which is again a distributed collection of data and here data is organized into columns. This data structure is used with SparkSQL and provides sql like operations over the data.
     
  3. Dataset - Dataset is strongly typed, relational and column based data structure in Spark used with SparkSQL.
     
  4. TUNGSTEN - The Tungsten is new data structure introduced with the Spark 2.x (SparkSQL) to provide high performance processing of data. The Tungsten is a product of Project Tungsten which is a parallel project with Spark 2.x. The Tungsten operations works on the byte-code level and by optimizing Spark Jobs for CPU/Memory efficiency.
     
  5. Graphframe - GraphFrames are DataFrames based package for Apache Spark, which is used for Graph Processing in Apache Spark. The Graphframe is available for Java, Scala, Python and R programming languages.

All these data structures makes Apache Spark very powerful framework for processing data in Big Data environment. Let's see simple examples of these data sets in Spark Scala.

Apache Spark data structures

RDD - Spark RDD (Resilient Distributed Dataset)

Apache Spark Framework was started with the RDD data structure and now it provides 5 highly efficient data structures. The RDD stands for Resilient Distributed Dataset which is a distributed data set over the nodes in the Spark Cluster. Here are the features of RDD:

  • Resilient: Apache Spark RDD is resilient and in case of failure of one or more nodes it can be recomputed.
     
  • Distributed: The RDD is distributed dataset which can be partitioned over several nodes in the cluster.
     
  • Immutable: Once RDD is created it can't be modified. If some modification is required then transformation can be apply to generate new RDD.
     
  • Lazy: Apache Spark RDD just represents the data which is a result of computation. It does not trigger any computation themselves, instead transformations and actions can be applying for processing of RDD data which generates new RDD.
     
  • Statically typed: Apache Spark RDD is statically types and has a type for example RDD[String] or RDD[Location] etc...

RDD datasets supports two type of operations:

  • Transformations
  • Actions

Now let's look at these one by one.

RDD supports following transformations:

  • Map - This function map() is used to iterate over every line in RDD
     
  • FlatMap - This is mostly used in the split operation and here for each input element we will get many output elements in the output RDD.
     
  • MapPartition - The MapPartition() function is like Map() function and the difference is that the MapPartion() runs separately on each partition (block)
     
  • Filter - This is used for filtering input RDD into new RDD based on certain criteria.
     
  • Sample - The sample() method is used to take sample data from RDD into in RDD. The takeSample() method is also there for taking samples from the RDD data sets.
     
  • Union - The union() operations are used to merge data of one RDD with another RDD into a new RDD. This is very useful if you have many two or more RDD and you want to merge  the data of these RDD into a new RDD. You easily do it by rdd1.union(rdd2) code.
       
  • Intersection - The intersection() method is used to get the common elements from two RDD into a new RDD. This is used when you want to find the common elements from two RDD. You can find the intersection with this code: rdd1.intersection(rdd2).
     
  • Distinct - The distinct() is used to get distinct data from the source RDD.
     
  • ReduceByKey - The reduceByKey() method is used to aggregate the data separately for each key in the RDD and it uses join() method to merge two RDD together by grouping elements with key. This method can be easily used for word count in Apache Spark. The reduceByKey() operation is lazily evaluated. Here is example code:
    val words = Array("one","Sixteen","two","nine","five",
      "nine","Sixteen","four","nine","Sixteen","four")
    val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)
    data.foreach(println)
     
  • GroupByKey - The groupByKey() function of RDD groups all the values in the data set by the key.
     
  • Join - The join() operation is Spark is used to combine data from two RDD into one rdd on the basis of key.
     
  • Cartesian - The cartesian() function is used to get the Cartesian product of two RDD into another RDD.
     
  • Repartition - Apache Spark Framework splits the data into number of partitions and then execute computations in parallel over distributed cluster. The repartition() can be used to increase or decrease the number of partitions for better performance of application. You can call repartition() method on the RDD and provide no of partitions as parameter e.g. val rdd2 = rdd1.repartition(5).
     
  • Coalesce - The coalesce() function is used to reduce the number of partitions of RDD without a full shuffle of data. So, if we have to reduce the number of partitions then we should use a coalesce()  method to avoid full re-shuffle of data.

RDD supports following actions:

  • count - This function is use to count the number of rows in RDD.
     
  • collect - The collect() method is used to get entire RDD to driver program. If the data size is huge then its advisable to use it with care because it retrieves all the RDD data on the driver node.
     
  • take - The take() function returns n number of RDD on the driver machine. If you want to return 10 rows then use the function take(10).
     
  • top - The top() method returns top n rows from RDD using the default ordering.
     
  • countByValue -  The countByValue() method is used to get the count of unique values in RDD as local Map.
     
  • reduce - The reduce() method in Spark RDD is an action which performs aggregation on the RDD. The reduce() function triggers execution of DAG and operations are performed on full  RDD over distributed cluster.
     
  • fold - The fold() function is like reduce() function but there is a little difference. The reduce() function throws exception in case of empty collection while fold() function does not.
     
  • aggregate - The aggregate() function is used when data type is different from the input type and this function gives flexibility to developers.
     
  • foreach - The foreach() function used to execute some operations on the each data item in RDD.


As we can see from the above details that Spark RDD dataset is very powerful and it provides many ways to process data efficiently on the spark cluster.

Dataframe

The Dataframe is second data structures added to the Apache Spark framework and its a columnar data structure. In Dataframe we are organizing the data into columns and rows. It runs of the SparkSQL Context and provides SQL operations.

The Dataframe is not types and its main disadvantage is lack of type safety. Dataframe supports operations like printSchema(), select(), filte(), groupBy() etc... In future lessons you will find many examples of Dataframes in Spark framework.

Dataset

The Dataset is just like Dataframe with full type safety. In Dataset the data is strongly typed and it solves the type safety issue faced in Dataframe. The Dataset also runs in Apache SparkSQL context and provides sql query features.

TUNGSTEN

The Tungsten data structure is introduced in Apache Spark 2.x (SparkSQL). This data structure is designed to improve the performance of Spark application to match with the bare metal performance. The Tungsten Project is working towards improving the efficiency of memory and CPU for application developed in Spark. The Tungsten project initiative focuses on the following three areas:

  1. The Memory Management and Binary Processing to overcome the overhead of JVM.
  2. The Cache-aware computation: by developing algorithms and data structures for fast processing
  3. The  Code generation: The runtime code generation to take full advantage of modern compilers and CPUs

This way Tungsten Project is working towards developing system for Apache Spark for fast processing to data.

Graphframe

The Graphframe is Apache Spark library for processing of large scale Graph Data on the distributed Spark cluster. This API is available for Java, Scala and Python programming languages. The GraphX module of Apache Spark Frameworks makes it possible for the running graph queries on the large data sets. The data from source can be loaded into GraphFrames and then perform various queries on the graph data sets. Since the processing is done in-memory and over the distributed cluster it provides very fast performance.

In Spark Graphframe can be initialized by passing vertices and edges data as constructor to GraphFrame() as shown below:

gf = GraphFrame(vertices, edges)

Once the GraphFrame is initialized you can run various graph queries on the data set.

In this section we have learned about all the Data structures provided by Apache Spark framework. This lesson presented details introduction to data structures in Apache Spark framework and little bit comparison between these data structures.

Check Tutorials at: Apache Spark Framework programming tutorial.