cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to distribute pyspark dataframe repartition and row count on Databricks?

etao
New Contributor III

Try to compare large datasets for discrepancy. The datasets come from two database tables, each with around 500 million rows. I use Pyspark subtract, joins (leftanti, leftsemi) to sorted out the difference. To distribute the workload, I need to repartition the two datasets based on the join key column. The repartition takes forever, and errored out. When the cluster dashboard shows only one executor working, although up to 5 workers being allocated. 

Py4JJavaError: An error occurred while calling o8067.javaToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 85.0 failed 4 times, most recent failure: Lost task 0.3 in stage 85.0 (TID 437) (10.201.112.98 executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 133932 ms Driver stacktrace:

Questions:

- Howe to distribute such workload as repartition and count of dataframes? 

-  Any other better solution to achieve the goal?

1 ACCEPTED SOLUTION

Accepted Solutions

etao
New Contributor III

Thanks for the information. Unfortunately, repartition did not work. Once data was pulled into DB as one partition, and repartition that data took forever as well, and the repartition work did not get distributed. Instead, I tried partitioning the data while bring data in with JDBC partition options. That approach works. 

View solution in original post

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @etao, To distribute the workload effectively, try repartitioning by the join key column or increasing the number of partitions. Use coalesce to reduce partitions without shuffling data. For better performance, consider broadcast joins for smaller datasets or handle data skew by adding a salt key to balance partitions. Also, optimize your cluster configuration by adjusting the number of executors and memory.

Feel free to ask if you have more questions or need further assistance!

etao
New Contributor III

Thanks for the information. Unfortunately, repartition did not work. Once data was pulled into DB as one partition, and repartition that data took forever as well, and the repartition work did not get distributed. Instead, I tried partitioning the data while bring data in with JDBC partition options. That approach works. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group