How to Create SparkSession in PySpark?

First of all you should learn to create object of SparkSession in your PySpark program. In the last session we have installed PyCharm IDE and setup if for PySpark coding. In this section I will teach you to create object of SparkSession and then use this object for writing few PySpark code examples.

In this tutorial we are going to understand the SparkSession API of PySpark and explore this API with the example code. The SparkSession is the entry point to run Spark Code on the Apache Spark cluster. So, as a data engineer you should understand this API and develop skills to the features of SparkSession in developing your program.

The Python library for Apache Spark, which is known as PySpark, is very popular among the data engineers, data analysts and data science professionals for developing Spark programs. The PySpark library gained popularity due to its simplicity and ease of use while developing Apache Spark programs. It is also very easy and less time consuming to learn PySpark and start developing Apache Spark programs for data processing and machine learning.

What is SparkSession?

The SparkSession is the entry point for the execution of the Spark Program on the Apache Spark cluster. This API provides the functions to work with the Dataset and DataFrame API. The SparkSession object is used to work with DataFrame, registering DataFrame as tables, run SQL queries over tables, work with parquet files and perform other various operations on the data set.

To create a SparkSession object the builder pattern is used.

Builder: This is a class attribute having a Builder, which is used for constructing the SparkSession instances.

Methods

The SparkSession object provides many method to work the with Apache Spark cluster. Now we will see all the important methods of SparkSession. Here are the methods with their description:

createDataFrame(data[, schema, ?])

The createDataFrame() method is used to creates a DataFrame from an RDD, a list or a pandas. DataFrame. So, you can the exiting Python supported objects to create DataFrame.

getActiveSession()

The getActiveSession() is used return the active SparkSession for the current thread, returned by the builder.

newSession()

If you have requirement to create a new session then you can use newSession() method. The new session has separate SQLConf, registered temporary views and UDFs. But the this new session is has shared SparkContext and table cache.

range(start[, end, step, numPartitions])

The range(start[, end, step, numPartitions]) method is used to create a DataFrame with single pyspark.sql.types.LongType column named id, containing elements in a range from start to end (exclusive) with step value step.

sql(sqlQuery)

The sql(sqlQuery) method used to run given sql query using the SparkSQL and return a DataFrame representing the result of the query.

stop()

The stop() method stops the underlying SparkContext.

table(tableName)

This method returns the specified table as a DataFrame and useful when you have to get the table data as DataFrame in your PySpark program.

Attributes

builder

The builder is a class attribute that provides a Builder to construct SparkSession instances. I this tutorial we have provided you the example of builder attribute.

catalog

The catalog attribute is an Interface through which developers can create, drop, alter or query underlying databases, tables, functions, etc.

conf

The conf is the runtime configuration interface for Apache Spark Cluster.

read

The read attributes is used to read data in as a DataFrame.

readStream

The readStream attribute is used to read data streams as a streaming DataFrame.

sparkContext

The sparkContext attribute is used to get the underlying SparkContext.

streams

The streams attribute returns a StreamingQueryManager. The StreamingQueryManager is very useful when working with the streaming data in Apache Spark.

udf

The udf returns a UDFRegistration for UDF registration.

version

With the help of version you will be able to know the version of Spark on which this application is running.

Check more tutorials at:

Apache Spark Tutorials Section

How to Create SparkSession in PySpark?