cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Write 160M rows with 300 columns into Delta Table using Databricks?

govind
New Contributor

Hi, I am using databricks to load data from one delta table into another delta table.

I'm using SIMBA Spark JDBC connector to pull data from delta table in my source instance and writing into delta table in my databricks instance.

The source has ~160M Rows and 300 columns of data.

While writing into delta table in my databricks instance, I'm getting following error:

An error occurred while calling o494.save. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 6, 10.82.228.157, executor 8): java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 16 tasks (4.1 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.

Also attached the detailed error log here errorlog.txt.

Here is my code snippet for writing into delta table:

file_location = '/dbfs/perf_test/sample_file' 
options = { "table_name": 'sample_file', "overwriteSchema": True, "mergeSchema": True } 
df.repartition(8).write.format('delta').mode('overwrite').options(**options).save(file_location)

My databricks instance config is:

r4.2xlarge, 61 GB Memory, 8 Cores 10 nodes (Scales up to 16nodes)

Here is my spark config:

spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 2047m
spark.scheduler.mode FAIR
spark.executor.cores 8
spark.executor.memory 42g
spark.driver.maxResultSize 0 (tried with 0 or 50g)
spark.driver.memory 42g
spark.driver.cores 8

Also I tried with setting up spark.driver.maxResultSize value to 0 and 50g which is not helping me.

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @govind@dqlabs.ai​ ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Kaniz_Fatma
Community Manager
Community Manager

Hi @govind@dqlabs.ai​ , There seems a mismatch between the spark connector and spark version being used. Can you please specify the version of your Spark and the connector?

jose_gonzalez
Moderator
Moderator

Hi @govind@dqlabs.ai​ ,

Have you tried to remove "repartition(8)"? is there a reason why you only want to have 8 partitions?

Anonymous
Not applicable

Hi @govind@dqlabs.ai​ 

Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group