cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Write 160M rows with 300 columns into Delta Table using Databricks?

govind
New Contributor

Hi, I am using databricks to load data from one delta table into another delta table.

I'm using SIMBA Spark JDBC connector to pull data from delta table in my source instance and writing into delta table in my databricks instance.

The source has ~160M Rows and 300 columns of data.

While writing into delta table in my databricks instance, I'm getting following error:

An error occurred while calling o494.save. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 6, 10.82.228.157, executor 8): java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 16 tasks (4.1 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.

Also attached the detailed error log here errorlog.txt.

Here is my code snippet for writing into delta table:

file_location = '/dbfs/perf_test/sample_file' 
options = { "table_name": 'sample_file', "overwriteSchema": True, "mergeSchema": True } 
df.repartition(8).write.format('delta').mode('overwrite').options(**options).save(file_location)

My databricks instance config is:

r4.2xlarge, 61 GB Memory, 8 Cores 10 nodes (Scales up to 16nodes)

Here is my spark config:

spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 2047m
spark.scheduler.mode FAIR
spark.executor.cores 8
spark.executor.memory 42g
spark.driver.maxResultSize 0 (tried with 0 or 50g)
spark.driver.memory 42g
spark.driver.cores 8

Also I tried with setting up spark.driver.maxResultSize value to 0 and 50g which is not helping me.

2 REPLIES 2

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @govind@dqlabs.ai​ ,

Have you tried to remove "repartition(8)"? is there a reason why you only want to have 8 partitions?

Anonymous
Not applicable

Hi @govind@dqlabs.ai​ 

Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.

Thanks!

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now