Starting with Apache Spark in Ubuntu 22.04 - Run PySpark program on Spark running on Ubuntu 22.04
Apache Spark is very powerful and popular in-memory data processing engine used in the Big Data environment. Apache Spark is used in the cloud environment and even in the on-premise environment to process data at large scale. Spark is in-memory computing engine which processes data at very high speed and its 10, 100, or 1000 times faster then the Hadoop map reduce processing.
Apache Spark is very popular and it is very important for the data engineering professional to learn Apache Spark and gain enough experience in large scale data processing. In this section we will show you how to get started with the Apache Spark on Ubuntu 22.04 operating system. You can easily install Ubuntu 22.04 in Oracle virtualbox running on your Windows operating system for this tutorial. In future sections you will able to use your instance of Apache Spark installed in Ubuntu 22.04. Check this tutorial for installing Ubuntu: How to install Ubuntu 22.04 LTS on Oracle Virtualbox?
We also recorded the video instruction for this tutorial which will help you in learning and completing the example very fast.
Here is the video tutorial of "Installing Apache Spark in Ubuntu 22.04 - Apache Spark on Ubuntu 22.04 for Development".
Installing Apache Spark in Ubuntu 22.04
Step 1: Install JDK/Java
First of all you should install Java/JDK 8/11/17 on your Ubuntu 22.04 desktop. After installing JDK you can proceed with the next step of installing Apache Spark.
Step 2: Downloading Apache Spark Latest version
Apache Spark can be downloaded from the official website at https://spark.apache.org/. Visit this website to download the latest version of Apache Spark. At the time of writing of this tutorial latest version of Apache Spark was spark-3.5.1 and I downloaded spark-3.5.1-bin-hadoop3.tgz file on my Ubuntu 22.04 operating system.
After visiting the Apache Spark website navigate to the Downloads section and you will find the option of downloading the latest Spark for your operating System. Here is the screen shot of the download page:
Step 4: Unzip spark-3.5.1-bin-hadoop3.tgz
Next we have to unzip spark-3.5.1-bin-hadoop3.tgz file using the Ubuntu Achieve Manager.
You can also use following command in the terminal for extracting spark-3.5.1-bin-hadoop3.tgz file:
tar -xzvf spark-3.5.1-bin-hadoop3.tgz
Here is the screenshot of the above command:
If you are using terminal in Ubuntu then above command can be used to extract .tgz file.
Step 5: Run pyspark terminal
Go to the bin directory of apache spark and then run ./pyspark command
./pyspark
You will get the following output:
user@user-VirtualBox:~/test/spark-3.5.1-bin-hadoop3/bin$ ./pyspark Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. 24/03/28 17:41:33 WARN Utils: Your hostname, user-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3) 24/03/28 17:41:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 24/03/28 17:41:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.1 /_/ Using Python version 3.10.12 (main, Nov 20 2023 15:14:05) Spark context Web UI available at http://10.0.2.15:4040 Spark context available as 'sc' (master = local[*], app id = local-1711627895299). SparkSession available as 'spark'.
Screenshot of the above step:
Step 6: Run Hello World program in pyspark terminal
Now you should run following command in the pyspark terminal
from pyspark import SparkContext from operator import add data = sc.parallelize(list("Hello World")) counts = data.map(lambda x: (x, 1)).reduceByKey(add).sortBy(lambda x: x[1], ascending=False).collect() for (word, count) in counts: print("{}: {}".format(word, count))
Here is the output of the above code:
Step 7: Create Dataframe from list in PySpark
Now we will create a list in Python and then use the list to create dataframe. Here is the code:
from pyspark.sql.types import IntegerType # Create List oneToTen = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # Convert to Dataframe df = spark.createDataFrame(oneToTen, IntegerType()) # Display data df.show() # Display record count df.count()
Here is the output of the above code:
0Step 8: View jobs in the spark ui
Open browser and navigate to http://localhost:4040/. Here in this UI you will be able to see the execution details of your jobs. This UI can be used to monitor the execution of jobs. Here is one of the screenshot:
1In this tutorial we have learned how to download, install and develop some of the program in Apache Spark. This tutorial is beginning guide for getting started with Apache Spark.
Related tutorials:
2