Starting with Apache Spark in Ubuntu 22.04

Starting with Apache Spark in Ubuntu 22.04 - Run PySpark program on Spark running on Ubuntu 22.04

Apache Spark is very powerful and popular in-memory data processing engine used in the Big Data environment. Apache Spark is used in the cloud environment and even in the on-premise environment to process data at large scale. Spark is in-memory computing engine which processes data at very high speed and its 10, 100, or 1000 times faster then the Hadoop map reduce processing.

Apache Spark is very popular and it is very important for the data engineering professional to learn Apache Spark and gain enough experience in large scale data processing. In this section we will show you how to get started with the Apache Spark on Ubuntu 22.04 operating system. You can easily install Ubuntu 22.04 in Oracle virtualbox running on your Windows operating system for this tutorial. In future sections you will able to use your instance of Apache Spark installed in Ubuntu 22.04. Check this tutorial for installing Ubuntu: How to install Ubuntu 22.04 LTS on Oracle Virtualbox?

We also recorded the video instruction for this tutorial which will help you in learning and completing the example very fast.

Here is the video tutorial of "Installing Apache Spark in Ubuntu 22.04 - Apache Spark on Ubuntu 22.04 for Development".

Installing Apache Spark in Ubuntu 22.04

Step 1: Install JDK/Java

First of all you should install Java/JDK 8/11/17 on your Ubuntu 22.04 desktop. After installing JDK you can proceed with the next step of installing Apache Spark.

Step 2: Downloading Apache Spark Latest version

Apache Spark can be downloaded from the official website at https://spark.apache.org/. Visit this website to download the latest version of Apache Spark. At the time of writing of this tutorial latest version of Apache Spark was spark-3.5.1 and I downloaded spark-3.5.1-bin-hadoop3.tgz file on my Ubuntu 22.04 operating system.

After visiting the Apache Spark website navigate to the Downloads section and you will find the option of downloading the latest Spark for your operating System. Here is the screen shot of the download page:

Downloading Apache Spark 3.5.1

Step 4: Unzip spark-3.5.1-bin-hadoop3.tgz

Next we have to unzip spark-3.5.1-bin-hadoop3.tgz file using the Ubuntu Achieve Manager.

Extract with archieve manager

You can also use following command in the terminal for extracting spark-3.5.1-bin-hadoop3.tgz file:

tar -xzvf spark-3.5.1-bin-hadoop3.tgz

Here is the screenshot of the above command:

Extract targz file example command

If you are using terminal in Ubuntu then above command can be used to extract .tgz file.

Step 5: Run pyspark terminal

Go to the bin directory of apache spark and then run ./pyspark command

./pyspark

You will get the following output:

user@user-VirtualBox:~/test/spark-3.5.1-bin-hadoop3/bin$ ./pyspark 

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux

Type "help", "copyright", "credits" or "license" for more information.

24/03/28 17:41:33 WARN Utils: Your hostname, user-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)

24/03/28 17:41:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

24/03/28 17:41:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /__ / .__/\_,_/_/ /_/\_\   version 3.5.1

      /_/



Using Python version 3.10.12 (main, Nov 20 2023 15:14:05)

Spark context Web UI available at http://10.0.2.15:4040

Spark context available as 'sc' (master = local[*], app id = local-1711627895299).

SparkSession available as 'spark'.

Screenshot of the above step:

PySpark shell

Step 6: Run Hello World program in pyspark terminal

Now you should run following command in the pyspark terminal

from pyspark import SparkContext
from operator import add

data = sc.parallelize(list("Hello World"))

counts = data.map(lambda x: 
	(x, 1)).reduceByKey(add).sortBy(lambda x: x[1],
	 ascending=False).collect()

for (word, count) in counts:
    print("{}: {}".format(word, count))

Here is the output of the above code:

Pyspark Hello World Example

Step 7: Create Dataframe from list in PySpark

Now we will create a list in Python and then use the list to create dataframe. Here is the code:

from pyspark.sql.types import IntegerType

# Create List
oneToTen = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Convert to Dataframe
df = spark.createDataFrame(oneToTen, IntegerType())
# Display data
df.show()
# Display record count
df.count()

Here is the output of the above code:

Create PySpark dataframe from list

Step 8: View jobs in the spark ui

Open browser and navigate to http://localhost:4040/. Here in this UI you will be able to see the execution details of your jobs. This UI can be used to monitor the execution of jobs. Here is one of the screenshot:

Spark UI

In this tutorial we have learned how to download, install and develop some of the program in Apache Spark. This tutorial is beginning guide for getting started with Apache Spark.

Starting with Apache Spark in Ubuntu 22.04