Databricks Community

sanjay · ‎03-30-2023

Hi,

I have spark job which is processing large data set, its taking too long to process the data. In Spark UI, I can see its running 1 tasks out of 9 tasks. Not sure how to run this in parellel. I have already mentioned auto scaling and providing upto 8 instances.

Attached image of spark UI.

Please suggest how to debug this and fix the performance issue.

-werners- · ‎03-30-2023

from the screenshot you provided it seems you are doing a merge statement.

Depending on the partitioning of your delta table this can be done in parallel or not.

f.e. if all your incoming data resides in one huge partition, spark will have to completely write this huge partition which can take a long time.

Can you share some code?

pvignesh92 · ‎03-30-2023

Hi @Sanjay Jain Did you get a chance to see how many partitions are available in your dataframe before performing the merge operation and how the data is distributed between them? This will help you to see if you have any skewed data. Also you might need to look at the key on which you are doing Merging to check the skewing on any specific set of values.

Below code will help you get the records per partition

from pyspark.sql.functions  import spark_partition_id
rawDf.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().show()

sanjay · ‎03-31-2023

My partition is based on date, here is partition information for around 70k records.

partitionId|count|

+-----------+-----+

| 0|14557|

| 1|25455|

| 2|20330|

| 3| 1776|

| 4| 2868|

| 5| 1251|

| 6| 1145|

| 7| 127

-werners- · ‎03-31-2023

that is pretty skewed, however: that does not explain why there is no parallelism.

The only reasons I see is that either:

-the merge only hits one partition

-you apply a coalesce(1) or repartition(1) somewhere

sanjay · ‎03-31-2023

As there are 8 partition and this is the same data I need to merge.

How to check how many partitions are used by merge. It should use all 8 partition.

No, I hav't used coalesce or repartition.

Is it possible to connect live, I can show you the code.

-werners- · ‎03-31-2023

in the history of the delta table you can see how many files have been rewritten (in the operation metrics column).

There are statistics like numtargetfilesadded and numtargetfilesremoved etc.

The fact that your source dataframe (so the incoming data) has 8 partitions, does not mean that the delta lake table also will update 8 partitions.

sanjay · ‎03-31-2023

Delta table has same columns used for source table and should have 8 partitions,

-werners- · ‎03-31-2023

the number of partitions in the delta table is not relevant, what is relevant is how many partitions or files are affected by the merge.

That can be dispayed in the delta history.

databricks can also apply optimizations while writing, so it is possible that it decides to write a single file instead of 8. writing will be worse but reading will be faster.

sanjay · ‎03-31-2023

Any suggestion to improve the performance. Is there any parameter configuration to optimize this. Any document on how to debug in Spark UI

-werners- · ‎03-31-2023

there are several methods:

you can disable optimizations (see the databricks delta lake performance optimization help files) but I would advise against that.

Databricks default settings of the most recent runtimes are pretty optimized IMO. You can write fast using 80 cpus (so 80 partitions) but that will have a negative performance impact when reading this data.

Semantic Partitioning of the delta table is certainly a good idea (if not already done). And there is also Z-ORDER.

There is no simple answer to this.

If your merge in the end will work in parallel, you also have to take the data skew into account.

Debugging is really hard if almost impossible in spark due to the parallel nature of the application.

Anonymous · ‎03-31-2023

Hi @Sanjay Jain

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.

Cheers!

sanjay · ‎04-03-2023

Hi Vidula,

I am not able to find right solution to this problem. Appreciate if you can provide any help.

Regards,

Sanjay

plondon · ‎07-24-2024

Will it be any different if using Spark but within Azure, i.e. faster?

Databricks Community

Spark tasks too slow and not doing parellel processing

Join Us as a Local Community Builder!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟