Try to compare large datasets for discrepancy. The datasets come from two database tables, each with around 500 million rows. I use Pyspark subtract, joins (leftanti, leftsemi) to sorted out the difference. To distribute the workload, I need to repartition the two datasets based on the join key column. The repartition takes forever, and errored out. When the cluster dashboard shows only one executor working, although up to 5 workers being allocated.
Py4JJavaError: An error occurred while calling o8067.javaToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 85.0 failed 4 times, most recent failure: Lost task 0.3 in stage 85.0 (TID 437) (10.201.112.98 executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 133932 ms Driver stacktrace:
Questions:
- Howe to distribute such workload as repartition and count of dataframes?
- Any other better solution to achieve the goal?