Databricks Community

etao · ‎08-02-2024

Try to compare large datasets for discrepancy. The datasets come from two database tables, each with around 500 million rows. I use Pyspark subtract, joins (leftanti, leftsemi) to sorted out the difference. To distribute the workload, I need to repartition the two datasets based on the join key column. The repartition takes forever, and errored out. When the cluster dashboard shows only one executor working, although up to 5 workers being allocated.

Py4JJavaError: An error occurred while calling o8067.javaToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 85.0 failed 4 times, most recent failure: Lost task 0.3 in stage 85.0 (TID 437) (10.201.112.98 executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 133932 ms Driver stacktrace:

Questions:

- Howe to distribute such workload as repartition and count of dataframes?

- Any other better solution to achieve the goal?

etao · ‎08-19-2024

Thanks for the information. Unfortunately, repartition did not work. Once data was pulled into DB as one partition, and repartition that data took forever as well, and the repartition work did not get distributed. Instead, I tried partitioning the data while bring data in with JDBC partition options. That approach works.

View solution in original post

etao · ‎08-19-2024

Thanks for the information. Unfortunately, repartition did not work. Once data was pulled into DB as one partition, and repartition that data took forever as well, and the repartition work did not get distributed. Instead, I tried partitioning the data while bring data in with JDBC partition options. That approach works.

Databricks Community

How to distribute pyspark dataframe repartition and row count on Databricks?

Photos

Join Us as a Local Community Builder!

Business Intelligence in the Era of AI

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Databricks Community Champion - March 2025 - Takuya Omi

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.