Write 160M rows with 300 columns into Delta Table using Databricks?

govind — Wed, 28 Jul 2021 14:49:40 GMT

Hi, I am using databricks to load data from one delta table into another delta table.

I'm using SIMBA Spark JDBC connector to pull data from delta table in my source instance and writing into delta table in my databricks instance.

The source has ~160M Rows and 300 columns of data.

While writing into delta table in my databricks instance, I'm getting following error:

An error occurred while calling o494.save. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 6, 10.82.228.157, executor 8): java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 16 tasks (4.1 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.

Also attached the detailed error log here errorlog.txt.

Here is my code snippet for writing into delta table:

file_location = '/dbfs/perf_test/sample_file' 
options = { "table_name": 'sample_file', "overwriteSchema": True, "mergeSchema": True } 
df.repartition(8).write.format('delta').mode('overwrite').options(**options).save(file_location)

My databricks instance config is:

r4.2xlarge, 61 GB Memory, 8 Cores 10 nodes (Scales up to 16nodes)

Here is my spark config:

spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 2047m
spark.scheduler.mode FAIR
spark.executor.cores 8
spark.executor.memory 42g
spark.driver.maxResultSize 0 (tried with 0 or 50g)
spark.driver.memory 42g
spark.driver.cores 8

Also I tried with setting up spark.driver.maxResultSize value to 0 and 50g which is not helping me.

Re: Write 160M rows with 300 columns into Delta Table using Databricks?

jose_gonzalez — Mon, 07 Mar 2022 23:17:36 GMT

Hi @govind@dqlabs.ai ,

Have you tried to remove "repartition(8)"? is there a reason why you only want to have 8 partitions?

Re: Write 160M rows with 300 columns into Delta Table using Databricks?

Anonymous — Mon, 02 May 2022 15:41:34 GMT

Hi @govind@dqlabs.ai

Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.

Thanks!

topic Re: Write 160M rows with 300 columns into Delta Table using Databricks? in Data Engineering

Write 160M rows with 300 columns into Delta Table using Databricks?

Re: Write 160M rows with 300 columns into Delta Table using Databricks?

Re: Write 160M rows with 300 columns into Delta Table using Databricks?