โ05-14-2025 03:21 AM
Hi team,
I have a dataframe with 1269408570800 rows . I need to write this data to unity catalog table.
How can I upload huge quantity of data ?
I'm using databricks i runtime 15.4 LTS with 4 workers and each worker type is i3.4xlarge and driver of type i3.4xlarge. I've tried to upload this data using chunksize but after 2 hours, it will result in an error.
Can anyone guide me as how can I upload such a huge data ?
โ05-14-2025 05:20 AM
Hi @Nagarathna, a couple of questions here...
โ05-14-2025 10:53 PM
Thanks for your reply.
Below is my answer to all the queries.
1. What error is the job returning after the 2 hours execution?
-> Below is the screenshot of error
2. What is the data source we are talking about? Are you trying to load data from a database, another warehouse, a bunch of csv/json/excel files?
-> Currently we are using pyspark and python code to create the data. So source is dataframe
3. How are you chunking the data? (if you could give a code example...)
-> df dataframe contains data.
total_rows = 1269408570800
batch_size = 1000000
num_batches = (total_rows // batch_size) + (1 if total_rows % batch_size != 0 else 0)
print(num_batches)
for i in range(num_batches):
batch_df = df.filter(f"spark_partition_id() == {i}")
batch_df.write.mode("overwrite").saveAsTable(table_path)
I've tried using sql to upload. Data will be loaded to table if I provide limit as 1,00,000. But If I increase it then this is also failing with same executor error. It will take time to upload if everytime we are only uploading 1,00,000 rows.
INSERT INTO dspf_test.bulksaleshourlydataperformance_1845.t_hourly_sales_summary_result select * from temp_view limit 100000000.
4. Are there any resource bottlenecks visible in the Spark UI (e.g., memory, CPU, disk, network)?
-> I could see that the CPU utilisation for all the 4 executors have reached 80% .
5. Do you see any relevant warnings or errors in the Spark executor/driver logs before the failure?
-> Yes . ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 146027 ms
6. The cluster is fix sized with 4 workers or auto-scale is enabled?
-> autoscale is not enabled
a month ago - last edited a month ago
Hey @Nagarathna @Lucas_TBrabo Iโd like to share my opinion and some tips that might help:
1. You should try to avoud filtering by spark_partition_id because you can create skewed partitions, you should use with repartition() and spark can optimize the execution plan.
2. You are using overwrite mode That means youโre deleting the entire table and rewriting it on every batch. This is not only inefficient, it also negates the benefit of chunking, and will keep triggering large-scale execution plans repeatedly.
If your goal is to progressively load the entire dataset into a Unity Catalog table, you should use append
3. Your current setup:
4 ร i3.4xlarge = ~64 cores total, but low parallelism
Suggestion:
Enable autoscaling, e.g. min=8, max=32 using smaller instances like i3.large or i3.xlarge for higher parallelism
This improves resiliency and reduces the chance of single-point executor crashes
I havenโt tried this myself, but these are some ideas that came to mind. If you give it a shot, feel free to share the results so others can benefit too.
Hope this helps, ๐
Isi
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now