Hi, I am using databricks to load data from one delta table into another delta table.
I'm using SIMBA Spark JDBC connector to pull data from delta table in my source instance and writing into delta table in my databricks instance.
The source has ~160M Rows and 300 columns of data.
While writing into delta table in my databricks instance, I'm getting following error:
An error occurred while calling o494.save. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 6, 10.82.228.157, executor 8): java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 16 tasks (4.1 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.
Also attached the detailed error log here errorlog.txt.
Here is my code snippet for writing into delta table:
file_location = '/dbfs/perf_test/sample_file'
options = { "table_name": 'sample_file', "overwriteSchema": True, "mergeSchema": True }
df.repartition(8).write.format('delta').mode('overwrite').options(**options).save(file_location)
My databricks instance config is:
r4.2xlarge, 61 GB Memory, 8 Cores
10 nodes (Scales up to 16nodes)
Here is my spark config:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 2047m
spark.scheduler.mode FAIR
spark.executor.cores 8
spark.executor.memory 42g
spark.driver.maxResultSize 0 (tried with 0 or 50g)
spark.driver.memory 42g
spark.driver.cores 8
Also I tried with setting up spark.driver.maxResultSize value to 0 and 50g which is not helping me.