cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to write trillions of rows to unity catalog table.

Nagarathna
New Contributor II

Hi team,

I have a dataframe with 1269408570800 rows . I need to write this data to unity catalog table.

How can I upload huge quantity of data ?

I'm using databricks i runtime 15.4 LTS with 4 workers and each worker type is i3.4xlarge and driver of type i3.4xlarge. I've tried to upload this data using chunksize but after 2 hours, it will result in an error.

Can anyone guide me as how can I upload such a huge data ?

3 REPLIES 3

Lucas_TBrabo
Databricks Employee
Databricks Employee

Hi @Nagarathna, a couple of questions here...

  1. What error is the job returning after the 2 hours execution?
  2. What is the data source we are talking about? Are you trying to load data from a database, another warehouse, a bunch of csv/json/excel files?
  3. How are you chunking the data? (if you could give a code example...)
  4. Are there any resource bottlenecks visible in the Spark UI (e.g., memory, CPU, disk, network)?
  5. Do you see any relevant warnings or errors in the Spark executor/driver logs before the failure?
  6. The cluster is fix sized with 4 workers or auto-scale is enabled?

Thanks for your reply.

Below is my answer to all the queries.

1. What error is the job returning after the 2 hours execution?
-> Below is the screenshot of error

Nagarathna_0-1747288248186.png

 

2. What is the data source we are talking about? Are you trying to load data from a database, another warehouse, a bunch of csv/json/excel files?
-> Currently we are using pyspark and python code to create the data. So source is dataframe

3. How are you chunking the data? (if you could give a code example...)

-> df dataframe contains data.

total_rows = 1269408570800
batch_size = 1000000
num_batches = (total_rows // batch_size) + (1 if total_rows % batch_size != 0 else 0)
print(num_batches)
 
for i in range(num_batches):
    batch_df = df.filter(f"spark_partition_id() == {i}")
    batch_df.write.mode("overwrite").saveAsTable(table_path)

 I've tried using sql to upload. Data will be loaded to table if I provide limit as 1,00,000. But If I increase it then this is also failing with same executor error. It will take time to upload if everytime we are only uploading 1,00,000 rows.

INSERT INTO dspf_test.bulksaleshourlydataperformance_1845.t_hourly_sales_summary_result select * from temp_view limit 100000000. 

 4. Are there any resource bottlenecks visible in the Spark UI (e.g., memory, CPU, disk, network)?

-> I could see that the CPU utilisation for all the 4 executors have reached 80% .

5. Do you see any relevant warnings or errors in the Spark executor/driver logs before the failure?

-> Yes . ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 146027 ms

6. The cluster is fix sized with 4 workers or auto-scale is enabled?

-> autoscale is not enabled

Isi
Contributor III

Hey @Nagarathna @Lucas_TBrabo  Iโ€™d like to share my opinion and some tips that might help:

1. You should try to avoud filtering by spark_partition_id because  you can create skewed partitions, you should use with repartition() and spark can optimize the execution plan.

2.
You are using overwrite mode That means youโ€™re deleting the entire table and rewriting it on every batch. This is not only inefficient, it also negates the benefit of chunking, and will keep triggering large-scale execution plans repeatedly.

If your goal is to progressively load the entire dataset into a Unity Catalog table, you should use append

3. Your current setup:

  • 4 ร— i3.4xlarge = ~64 cores total, but low parallelism

    Suggestion:

    • Enable autoscaling, e.g. min=8, max=32 using smaller instances like i3.large or i3.xlarge for higher parallelism

    • This improves resiliency and reduces the chance of single-point executor crashes

I havenโ€™t tried this myself, but these are some ideas that came to mind. If you give it a shot, feel free to share the results so others can benefit too.

Hope this helps, ๐Ÿ™‚

Isi

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now