cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

write operation to the Delta table is not completing.

borori
New Contributor II

Using a cluster in serverless mode, three tables are joined and the data frame is written as follows

df.write.mode('append').saveAsTable('table name')

and shema is below

  • date string (ymd format)
  • id bigint
  • value string
  • partition by date

 

After about one minute of execution as a job, the profiler stops progressing and does not return any response.

When I canceled the job, the progress of the profiler was updated with progress, and the tree looked as if it stopped writing to the delta table.

When I reduced the amount of data, it succeeded, but when I rerun the job, the same problem occurs. The same problem occurs when the table is re-created and re-run.

Also, I comment out the write process and modify the display to show the result, the result comes back in about 1 minute.
The select seems to work fine, but only the write is causing this problem.

What kind of analysis should i do?

1 ACCEPTED SOLUTION

Accepted Solutions

Brahmareddy
Valued Contributor II

Hi @borori,

How are you doing today?

As per my understanding, Consider checking the cluster's resource limits in serverless mode to ensure it's not hitting any memory or I/O constraints. You might also want to repartition the DataFrame based on the date column before writing to balance the load across partitions. It could be helpful to examine Delta logs to see if they provide any insights on the issue during the write process. Also, review your partitioning strategy—too many or too few partitions can affect performance. Lastly, try adjusting the job's parallelism settings by tuning parameters like spark.sql.shuffle.partitions to improve the write performance.

Give a try and let me know if it works.

Regards,

Brahma

View solution in original post

2 REPLIES 2

Brahmareddy
Valued Contributor II

Hi @borori,

How are you doing today?

As per my understanding, Consider checking the cluster's resource limits in serverless mode to ensure it's not hitting any memory or I/O constraints. You might also want to repartition the DataFrame based on the date column before writing to balance the load across partitions. It could be helpful to examine Delta logs to see if they provide any insights on the issue during the write process. Also, review your partitioning strategy—too many or too few partitions can affect performance. Lastly, try adjusting the job's parallelism settings by tuning parameters like spark.sql.shuffle.partitions to improve the write performance.

Give a try and let me know if it works.

Regards,

Brahma

borori
New Contributor II

Thank you for your advice. I couldn't come to a conclusion based on what you told me, but it gave me an opportunity to review all the logs again. The cause was that the amount of data became too large due to joining between null data. The advice was very helpful as it allowed me to reanalyze the cause. thank you

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group