Write 160M rows with 300 columns into Delta Table using Databricks?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-28-2021 07:49 AM
Hi, I am using databricks to load data from one delta table into another delta table.
I'm using SIMBA Spark JDBC connector to pull data from delta table in my source instance and writing into delta table in my databricks instance.
The source has ~160M Rows and 300 columns of data.
While writing into delta table in my databricks instance, I'm getting following error:
An error occurred while calling o494.save. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 6, 10.82.228.157, executor 8): java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 16 tasks (4.1 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.
Also attached the detailed error log here errorlog.txt.
Here is my code snippet for writing into delta table:
file_location = '/dbfs/perf_test/sample_file'
options = { "table_name": 'sample_file', "overwriteSchema": True, "mergeSchema": True }
df.repartition(8).write.format('delta').mode('overwrite').options(**options).save(file_location)My databricks instance config is:
r4.2xlarge, 61 GB Memory, 8 Cores 10 nodes (Scales up to 16nodes)
Here is my spark config:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 2047m
spark.scheduler.mode FAIR
spark.executor.cores 8
spark.executor.memory 42g
spark.driver.maxResultSize 0 (tried with 0 or 50g)
spark.driver.memory 42g
spark.driver.cores 8
Also I tried with setting up spark.driver.maxResultSize value to 0 and 50g which is not helping me.
- Labels:
-
Dataframe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-07-2022 03:17 PM
Hi @govind@dqlabs.ai ,
Have you tried to remove "repartition(8)"? is there a reason why you only want to have 8 partitions?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-02-2022 08:41 AM
Hi @govind@dqlabs.ai
Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.
Thanks!