Read text file in PySpark - How to read a text file in PySpark?
The PySpark is very powerful API which provides functionality to read files into RDD and perform various operations. This tutorial is very simple tutorial which will read text file and then collect the data into RDD.
The term RDD stands for Resilient Distributed Dataset in Spark and it is using the RAM on the nodes in spark cluster to store the data. Any computation done on RDD is executed on the workers nodes in the Spark Cluster.
This way RDD can be used to process large amount of data in memory over distributed cluster and then processed data can be fetched on the master node. This architecture of Spark makes it very powerful for distributed processing of data.
Program will collect the data into lines and then print on the console. Spark is very powerful framework that uses the memory over distributed cluster and process in parallel.
We will create a text file with following text:
one two three four five six seven eight nine ten
create a new file in any of directory of your computer and add above text. In my example I have created file test1.txt. We will write PySpark code to read the data into RDD and print on console.
So, first thing is to import following library in "readfile.py":
from pyspark import SparkContext from pyspark import SparkConf
This will import required Spark libraries.
Next create SparkContext with following code:
# create Spark context with Spark configuration conf = SparkConf().setAppName("read text file in pyspark") sc = SparkContext(conf=conf)
As explained earlier SparkContext (sc) is the entry point in Spark Cluster. We will use sc object to perform file read operation and then collect the data.
Here is complete program code (readfile.py):
from pyspark import SparkContext from pyspark import SparkConf # create Spark context with Spark configuration conf = SparkConf().setAppName("read text file in pyspark") sc = SparkContext(conf=conf) # Read file into RDD lines = sc.textFile("/home/deepak/test1.txt") # Call collect() to get all data llist = lines.collect() # print line one by line for line in llist: print(line)
To run the program use spark-submit tool and command is:
./spark-submit readfile.py
Above command will display following output:
deepak@deepak-VirtualBox:~/spark/spark-2.3.0-bin-hadoop2.7/bin$ ./spark-submit readfile.py one two three four five six seven eight nine ten deepak@deepak-VirtualBox:~/spark/spark-2.3.0-bin-hadoop2.7/bin$
In this tutorial we have learned how to read a text file in RDD and then print data line by line? We have large number of Spark tutorials and you can view all these tutorials at: