Databricks Community

Nagarathna · ‎05-14-2025

Hi team,

I have a dataframe with 1269408570800 rows . I need to write this data to unity catalog table.

How can I upload huge quantity of data ?

I'm using databricks i runtime 15.4 LTS with 4 workers and each worker type is i3.4xlarge and driver of type i3.4xlarge. I've tried to upload this data using chunksize but after 2 hours, it will result in an error.

Can anyone guide me as how can I upload such a huge data ?

Lucas_TBrabo · ‎05-14-2025

Hi @Nagarathna, a couple of questions here...

What error is the job returning after the 2 hours execution?
What is the data source we are talking about? Are you trying to load data from a database, another warehouse, a bunch of csv/json/excel files?
How are you chunking the data? (if you could give a code example...)
Are there any resource bottlenecks visible in the Spark UI (e.g., memory, CPU, disk, network)?
Do you see any relevant warnings or errors in the Spark executor/driver logs before the failure?
The cluster is fix sized with 4 workers or auto-scale is enabled?

Nagarathna · ‎05-14-2025

Thanks for your reply.

Below is my answer to all the queries.

1. What error is the job returning after the 2 hours execution?
-> Below is the screenshot of error

2. What is the data source we are talking about? Are you trying to load data from a database, another warehouse, a bunch of csv/json/excel files?
-> Currently we are using pyspark and python code to create the data. So source is dataframe

3. How are you chunking the data? (if you could give a code example...)

-> df dataframe contains data.

total_rows = 1269408570800
batch_size = 1000000
num_batches = (total_rows // batch_size) + (1 if total_rows % batch_size != 0 else 0)
print(num_batches)
 
for i in range(num_batches):
    batch_df = df.filter(f"spark_partition_id() == {i}")
    batch_df.write.mode("overwrite").saveAsTable(table_path)

I've tried using sql to upload. Data will be loaded to table if I provide limit as 1,00,000. But If I increase it then this is also failing with same executor error. It will take time to upload if everytime we are only uploading 1,00,000 rows.

INSERT INTO dspf_test.bulksaleshourlydataperformance_1845.t_hourly_sales_summary_result select * from temp_view limit 100000000.

4. Are there any resource bottlenecks visible in the Spark UI (e.g., memory, CPU, disk, network)?

-> I could see that the CPU utilisation for all the 4 executors have reached 80% .

5. Do you see any relevant warnings or errors in the Spark executor/driver logs before the failure?

-> Yes . ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 146027 ms

6. The cluster is fix sized with 4 workers or auto-scale is enabled?

-> autoscale is not enabled

Isi · ‎05-17-2025

Hey @Nagarathna @Lucas_TBrabo I’d like to share my opinion and some tips that might help:

1. You should try to avoud filtering by spark_partition_id because you can create skewed partitions, you should use with repartition() and spark can optimize the execution plan.

2. You are using overwrite mode That means you’re deleting the entire table and rewriting it on every batch. This is not only inefficient, it also negates the benefit of chunking, and will keep triggering large-scale execution plans repeatedly.

If your goal is to progressively load the entire dataset into a Unity Catalog table, you should use append

3. Your current setup:

4 × i3.4xlarge = ~64 cores total, but low parallelism
Suggestion:
- Enable autoscaling, e.g. min=8, max=32 using smaller instances like i3.large or i3.xlarge for higher parallelism
- This improves resiliency and reduces the chance of single-point executor crashes

I haven’t tried this myself, but these are some ideas that came to mind. If you give it a shot, feel free to share the results so others can benefit too.

Hope this helps, 🙂

Isi