Spark Data Structure

In this tutorial are discussing spark data structures available in Apache Spark Framework.

Spark Data Structure

Spark Data Structure - Data Structure types in Apache Spark

Understanding the various types of data structures provided in Apache Spark framework. First of all developer must understand the data structures provided by Apache Spark framework so that they can use it in better way to meet application requirements.

Spark Framework is distributed parallel computing engine with following functionality:

  • Spark is in-memory computation engine for processing large data sets
  • Its distributes data and computation across multiple nodes in the cluster
  • It provides machine learning API which can be used to develop ML/AI applications
  • It provides distributed infrastructure for ML model training and deployment

In this tutorial we are exploring various data structures supported by Spark framework. These data structures are supported in Spark programming using Java, Scala, Python and R. Programmers can these data structures while writing Spark application using any of the following programming languages:

  • Java
  • Scala
  • Python
  • R

Apache Spark is distributed, large scale parallel process framework over distributed Spark Cluster. Nodes on the Spark cluster perform computations on the large scale data sets. This fast in-memory parallel processing engine works very well if correct data structures are used in the programming. So, its necessary for developers to understand data structures very well and practice it with many examples.

Spark Data Structure

Apache Spark Framework provides following Data Structures:

  • RDD
  • Data Frames
  • Dataset
  • Tungsten
  • Graphframe

Now we will discuss all these data structures one by one and see the features of each of them.

1. RDD

RDD is also known as Resilient Distributed Dataset which was introduced with the first version of Spark Framework. RDD is immutable data structure that distributes the data in partitions across the nodes in the cluster. Computation on the data is done on the node where data is present. This makes architecture flexible and enables parallel processing of data. RDD provides interfaces for performing  transformations and actions.

RDD is immutable object and only created by coarse grained operations such as map, filter, group by etc. and once created can't be modified. You can perform certain operation and then new RDD is created.

Common use cases where you can use RDD are:

  • Low level transformation and actions on data sets
  • When data is unstructured or stream of media or streams of text
  • When you are writing your program to manipulate data with functional programming.
  • When optimization and performance benefits of RDD's are more important in program

In the future tutorials we will provide you many examples of working with RDD's in PySpark.

2. Data Frames

Like RDD, Data Frames are also immutable and once created can't be modified. Data Frames can be recreated with the operation like map, filter, etc.. Data frames are distributed data collection which is organized into named columns just like RDBS table row, columns.

Data Framework runs on the Spark SQL Context and provides SQL like queries for querying data. DataFrames support data from many different sources including Hive tables, Structured Data files, external databases, or existing RDDs.

DataFrames API was designed to meet the requirement of modern Big Data and Data Science applications. It is highly influenced with the designing principals of Data Frames in R Programming and Pandas in Python.

3. Dataset

DataSet is also distributed collection of data which organize data into named columns and add type safety to it. So, it comes with the type safety and it is checked at the compile time. This was developed by the aim of adding type safety to dataframes.

4. Tungsten

Tungsten is new component added to Spark SQL which provides efficient operation on data sets as it works directly on the byte level. The type safety was added to Dataset and now data already knows the format it contains, so with this hint encoders are generated to perform operations on data fast in Tungsten format.

Data Stored in Tungsten takes 4 to 5 times less space and provides better performance including better memory utilization.

5. Graphframe

The Graphframe data structure is used for graph data storage and processing. The Graphframe stores data into 2 distinct data frames:

  • One for graph vertices, and
  • One for graph edges

In this tutorial we have learned about various data structures supported by Spark Framework.

Check more tutorials at: