cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Pandas API on Spark creates huge query plans

varshanagarajan
New Contributor

Hello,

I have a piece of code written in Pyspark and Pandas API on Spark. On comparing the query plans, I see Pandas API on Spark creates huge query plans whereas Pyspark plan is a tiny one. Furthermore, with Pandas API on spark, we see a lot of inconsistencies in the generated results. Pyspark code executes in 4 mins and Pandas API on Spark takes 16 mins. Do we have reasons or documented issues for these? Would like to know why this is happening so that I can address the problem

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @varshanagarajan , 

Yes, the differences in query plans and performance between PySpark and Pandas API on Spark can be explained by the underlying architecture and implementation details.

Pandas API on Spark provides a set of APIs that allow pandas users to easily scale their pandas code to work with distributed data in a SparkYes, there are several reasons for the performance differences that you are seeing between Pyspark and Pandas API on Spark, and the inconsistencies in the results.

  1. Query Optimization: Pyspark has a sophisticated query optimizer that can help optimize queries based on the underlying data structures and the available resources. Pandas API on Spark is built on top of the PySpark API and generally relies on PySpark's catalyst optimizer. However, in some cases, applying Pandas operations on large datasets may lead to suboptimal query plans. The optimizer may struggle with applying the optimal set of transformations necessary to ensure good performance.

  2. Data Distribution: Spark is built for distributed computing and is designed to handle large-scale datasets that can be distributed across multiple worker nodes. Pandas, on the other hand, is designed for single-node machines. When working with Pandas API on Spark, the data may be first collected from Spark's distributed data structure in order to apply Pandas operations, then re-distributed to Spark. This can lead to shuffling, extra overhead and poor performance especially on larger datasets.

  3. Memory usage: Pandas API on Spark generally creates extra elements that can consume large amounts of memory relative to the same operation done in PySpark. That can lead to large memory allocations, extra GC runs and disk spilling that slows the query down.

  4. API Compatibility: The Pandas API on Spark tries to emulate pandas' API as closely as possible, but there may be some differences or limitations in the API which could lead to unexpected or inconsistent results.

To address the issues, you can try the following:

  1. Optimize Data Distribution: Avoid using Pandas API on Spark for operations that result in shuffles or the transfer of large amounts of data, as this can lead to performance issues. Try to distribute the data across multiple partitions and nodes and make sure to minimize shuffling operations.

  2. Optimize Query plan: In cases where the query plan generated by Pandas API on Spark is suboptimal, you can try tweaking the query to get better performance. Reviewing the query execution plan and trying to apply filters earlier in the processing pipeline, repartitioning the dataset or rearranging the sequence of operations, can help optimize the query performance.

  3. Memory and Garbage Collection: If you are encountering memory-related performance issues with Pandas API on Spark, you can try tuning the memory configuration of your Spark applications. You can use the Spark UI to monitor memory usage and look for potential memory leaks or large object allocations.

  4. API Compatibility: Be aware of the limitations of the API and try to use the available functions in a manner that better fits the underlying processing framework. For example, avoid row-by-row operations and try to use vectorized operations as much as possible to minimize the overhead associated with the Python interpreter.

In general, Pandas API on Spark should not be used as a replacement for PySpark code in all cases. Rather, it should be used selectively for specific use cases where you need to selectively leverage Pandas library capabilities like 'groupby' functionality or where it makes code more readable and easier to maintain.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group