- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-11-2024 08:52 AM
Using a cluster in serverless mode, three tables are joined and the data frame is written as follows
df.write.mode('append').saveAsTable('table name')
and shema is below
- date string (ymd format)
- id bigint
- value string
- partition by date
After about one minute of execution as a job, the profiler stops progressing and does not return any response.
When I canceled the job, the progress of the profiler was updated with progress, and the tree looked as if it stopped writing to the delta table.
When I reduced the amount of data, it succeeded, but when I rerun the job, the same problem occurs. The same problem occurs when the table is re-created and re-run.
Also, I comment out the write process and modify the display to show the result, the result comes back in about 1 minute.
The select seems to work fine, but only the write is causing this problem.
What kind of analysis should i do?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-18-2024 07:30 AM
Hi @borori,
How are you doing today?
As per my understanding, Consider checking the cluster's resource limits in serverless mode to ensure it's not hitting any memory or I/O constraints. You might also want to repartition the DataFrame based on the date column before writing to balance the load across partitions. It could be helpful to examine Delta logs to see if they provide any insights on the issue during the write process. Also, review your partitioning strategy—too many or too few partitions can affect performance. Lastly, try adjusting the job's parallelism settings by tuning parameters like spark.sql.shuffle.partitions to improve the write performance.
Give a try and let me know if it works.
Regards,
Brahma
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-18-2024 07:30 AM
Hi @borori,
How are you doing today?
As per my understanding, Consider checking the cluster's resource limits in serverless mode to ensure it's not hitting any memory or I/O constraints. You might also want to repartition the DataFrame based on the date column before writing to balance the load across partitions. It could be helpful to examine Delta logs to see if they provide any insights on the issue during the write process. Also, review your partitioning strategy—too many or too few partitions can affect performance. Lastly, try adjusting the job's parallelism settings by tuning parameters like spark.sql.shuffle.partitions to improve the write performance.
Give a try and let me know if it works.
Regards,
Brahma
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-07-2024 06:41 AM
Thank you for your advice. I couldn't come to a conclusion based on what you told me, but it gave me an opportunity to review all the logs again. The cause was that the amount of data became too large due to joining between null data. The advice was very helpful as it allowed me to reanalyze the cause. thank you

