topic Re: Spark tasks too slow and not doing parellel processing in Data Engineering

Spark tasks too slow and not doing parellel processing

sanjay — Thu, 30 Mar 2023 07:42:50 GMT

Hi,

I have spark job which is processing large data set, its taking too long to process the data. In Spark UI, I can see its running 1 tasks out of 9 tasks. Not sure how to run this in parellel. I have already mentioned auto scaling and providing upto 8 instances.

Attached image of spark UI.

Please suggest how to debug this and fix the performance issue.

Re: Spark tasks too slow and not doing parellel processing

-werners- — Thu, 30 Mar 2023 08:26:28 GMT

from the screenshot you provided it seems you are doing a merge statement.

Depending on the partitioning of your delta table this can be done in parallel or not.

f.e. if all your incoming data resides in one huge partition, spark will have to completely write this huge partition which can take a long time.

Can you share some code?

Re: Spark tasks too slow and not doing parellel processing

pvignesh92 — Thu, 30 Mar 2023 13:47:00 GMT

Hi @Sanjay Jain Did you get a chance to see how many partitions are available in your dataframe before performing the merge operation and how the data is distributed between them? This will help you to see if you have any skewed data. Also you might need to look at the key on which you are doing Merging to check the skewing on any specific set of values.

Below code will help you get the records per partition

from pyspark.sql.functions  import spark_partition_id
rawDf.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().show()

Re: Spark tasks too slow and not doing parellel processing

sanjay — Fri, 31 Mar 2023 10:07:04 GMT

My partition is based on date, here is partition information for around 70k records.

partitionId|count|

+-----------+-----+

| 0|14557|

| 1|25455|

| 2|20330|

| 3| 1776|

| 4| 2868|

| 5| 1251|

| 6| 1145|

| 7| 127

Re: Spark tasks too slow and not doing parellel processing

-werners- — Fri, 31 Mar 2023 10:12:06 GMT

that is pretty skewed, however: that does not explain why there is no parallelism.

The only reasons I see is that either:

-the merge only hits one partition

-you apply a coalesce(1) or repartition(1) somewhere

Re: Spark tasks too slow and not doing parellel processing

sanjay — Fri, 31 Mar 2023 10:22:01 GMT

As there are 8 partition and this is the same data I need to merge.

How to check how many partitions are used by merge. It should use all 8 partition.

No, I hav't used coalesce or repartition.

Is it possible to connect live, I can show you the code.

Re: Spark tasks too slow and not doing parellel processing

-werners- — Fri, 31 Mar 2023 10:29:29 GMT

in the history of the delta table you can see how many files have been rewritten (in the operation metrics column).

There are statistics like numtargetfilesadded and numtargetfilesremoved etc.

The fact that your source dataframe (so the incoming data) has 8 partitions, does not mean that the delta lake table also will update 8 partitions.

Re: Spark tasks too slow and not doing parellel processing

sanjay — Fri, 31 Mar 2023 10:46:30 GMT

Delta table has same columns used for source table and should have 8 partitions,

Re: Spark tasks too slow and not doing parellel processing

-werners- — Fri, 31 Mar 2023 10:50:48 GMT

the number of partitions in the delta table is not relevant, what is relevant is how many partitions or files are affected by the merge.

That can be dispayed in the delta history.

databricks can also apply optimizations while writing, so it is possible that it decides to write a single file instead of 8. writing will be worse but reading will be faster.

Re: Spark tasks too slow and not doing parellel processing

sanjay — Fri, 31 Mar 2023 11:01:04 GMT

Any suggestion to improve the performance. Is there any parameter configuration to optimize this. Any document on how to debug in Spark UI

Re: Spark tasks too slow and not doing parellel processing

-werners- — Fri, 31 Mar 2023 11:10:31 GMT

there are several methods:

you can disable optimizations (see the databricks delta lake performance optimization help files) but I would advise against that.

Databricks default settings of the most recent runtimes are pretty optimized IMO. You can write fast using 80 cpus (so 80 partitions) but that will have a negative performance impact when reading this data.

Semantic Partitioning of the delta table is certainly a good idea (if not already done). And there is also Z-ORDER.

There is no simple answer to this.

If your merge in the end will work in parallel, you also have to take the data skew into account.

Debugging is really hard if almost impossible in spark due to the parallel nature of the application.

Re: Spark tasks too slow and not doing parellel processing

Anonymous — Sat, 01 Apr 2023 02:12:40 GMT

Hi @Sanjay Jain

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.

Cheers!

Re: Spark tasks too slow and not doing parellel processing

sanjay — Mon, 03 Apr 2023 07:02:45 GMT

Hi Vidula,

I am not able to find right solution to this problem. Appreciate if you can provide any help.

Regards,

Sanjay

Re: Spark tasks too slow and not doing parellel processing

plondon — Wed, 24 Jul 2024 11:07:00 GMT

Will it be any different if using Spark but within Azure, i.e. faster?