Exploring the Apache Spark 3.0.0 Features
This article is discussing the features and improvements of highly popular Apache Spark framework release 3.0.0. The Apache Spark 3.0.0 is the first release in the 3.x line and it's going to be a long-terms supported project.
Apache Software Foundation released the first version of 3.x line and it comes with the groundbreaking features. The first release of Spark 3.x line is Apache Spark 3.0.0, which was released on Jun 18, 2020. This build is committed on the github as git tag v3.0.0 . This was committed on git June 10, 2020. The Spark 3.x line is the long-term project which is released for the production deployment. This version of Apache Spark framework brings many innovations from Spark 2.x and many new features. This build is released after closing more than 3,400 tickets with the help of over 440 contributors around the world. Apache Spark 3.x is the most advanced framework of Apache Spark which is ready to power today?s industry for processing vast collections of data at lightning speed.
Apache Spark Framework was first released 10-years back in 2010, the year 2020 is Spark's 10-year anniversary. In the past 10 years the Apache Spark framework has grown as one of the most active projects in the Big Data industry.
Apache Spark is a highly popular framework for processing data in the Big Data environment. This framework is used for in-memory data processing, data science, machine learning and data analytics in Big Data environments.
The Spark SQL of Spark framework is the most active component and over 46% of resolved tickets are related to the Spark SQL. The Apache Spark 3.0.0 comes with enhancements to all the higher-level libraries including MLlib, API's, Spark SQL and DataFrames. As per the TPC-DS 30TB benchmark, Apache Spark 3.0 is approximately two times faster in performance than previous Spark 2.4. So, the Spark 3.0.0 brings performance improvements also, which is good for processing large collections of data in Big Data.
Apache Spark 3.x is a major release of the already popular Apache Spark framework. This version will drop the support for Python 2.x and Apache Spark 3.x needs Python3 to run the programs developed in Python. So, if you are planning to use Spark 3.x in your production environment then you have to use Python 3.x. In this release of Spark, the Scala version is also upgraded to version 2.12. Apache Spark 3.x also fully supports JDK 11, while the support for Python 2.x is heavily deprecated.
Apache Spark 3.x is coming with a lot of features and more optimized Spark SQL. This version of Spark framework will help both data analysts and data scientists to develop better solutions for the industries. Here are the lists of features of Apache Spark 3.x line:
- Spark Graph: Cypher Script & Property Graph
- Python 3, Scala 2.12 and JDK 11
- Deep Learning: Adds GPU Support
- Log Loss: Support
- Binary Files
- Kubernetes
- Koalas: Spark scale to Pandas
- Kafka Streaming: include Headers
- YARN Features
- Analyze Cached Data
- Dynamic Partition Pruning
- Delta Lake
- Decision Tree in SparkML.
- Improved Optimizer during query execution.
- Pluggable catalog integration.
- Metric for executor memory.
- Dynamic Allocation.