- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-02-2024 05:39 AM
Try to compare large datasets for discrepancy. The datasets come from two database tables, each with around 500 million rows. I use Pyspark subtract, joins (leftanti, leftsemi) to sorted out the difference. To distribute the workload, I need to repartition the two datasets based on the join key column. The repartition takes forever, and errored out. When the cluster dashboard shows only one executor working, although up to 5 workers being allocated.
Py4JJavaError: An error occurred while calling o8067.javaToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 85.0 failed 4 times, most recent failure: Lost task 0.3 in stage 85.0 (TID 437) (10.201.112.98 executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 133932 ms Driver stacktrace:
Questions:
- Howe to distribute such workload as repartition and count of dataframes?
- Any other better solution to achieve the goal?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-19-2024 04:51 AM
Thanks for the information. Unfortunately, repartition did not work. Once data was pulled into DB as one partition, and repartition that data took forever as well, and the repartition work did not get distributed. Instead, I tried partitioning the data while bring data in with JDBC partition options. That approach works.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-19-2024 04:51 AM
Thanks for the information. Unfortunately, repartition did not work. Once data was pulled into DB as one partition, and repartition that data took forever as well, and the repartition work did not get distributed. Instead, I tried partitioning the data while bring data in with JDBC partition options. That approach works.

