<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Spark tasks too slow and not doing parellel processing in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6817#M2826</link>
    <description>&lt;P&gt;Hi @Sanjay Jain​&amp;nbsp;Did you get a chance to see how many partitions are available in your dataframe before performing the merge operation and how the data is distributed between them? This will help you to see if you have any skewed data. Also you might need to look at the key on which you are doing Merging to check the skewing on any specific set of values. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Below code will help you get the records per partition&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql.functions  import spark_partition_id
rawDf.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().show()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 30 Mar 2023 13:47:00 GMT</pubDate>
    <dc:creator>pvignesh92</dc:creator>
    <dc:date>2023-03-30T13:47:00Z</dc:date>
    <item>
      <title>Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6815#M2824</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have spark job which is processing large data set, its taking too long to process the data. In Spark UI, I can see its running 1 tasks out of 9 tasks. Not sure how to run this in parellel. I have already mentioned auto scaling and providing upto 8 instances.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Attached image of spark UI.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please suggest how to debug this and fix the performance issue.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2023 07:42:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6815#M2824</guid>
      <dc:creator>sanjay</dc:creator>
      <dc:date>2023-03-30T07:42:50Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6816#M2825</link>
      <description>&lt;P&gt;from the screenshot you provided it seems you are doing a merge statement.&lt;/P&gt;&lt;P&gt;Depending on the partitioning of your delta table this can be done in parallel or not.&lt;/P&gt;&lt;P&gt;f.e. if all your incoming data resides in one huge partition, spark will have to completely write this huge partition which can take a long time.&lt;/P&gt;&lt;P&gt;Can you share some code?&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2023 08:26:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6816#M2825</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-03-30T08:26:28Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6817#M2826</link>
      <description>&lt;P&gt;Hi @Sanjay Jain​&amp;nbsp;Did you get a chance to see how many partitions are available in your dataframe before performing the merge operation and how the data is distributed between them? This will help you to see if you have any skewed data. Also you might need to look at the key on which you are doing Merging to check the skewing on any specific set of values. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Below code will help you get the records per partition&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql.functions  import spark_partition_id
rawDf.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().show()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2023 13:47:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6817#M2826</guid>
      <dc:creator>pvignesh92</dc:creator>
      <dc:date>2023-03-30T13:47:00Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6818#M2827</link>
      <description>&lt;P&gt;My partition is based on date, here is partition information for around 70k records.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;partitionId|count|&lt;/P&gt;&lt;P&gt;+-----------+-----+&lt;/P&gt;&lt;P&gt;|          0|14557|&lt;/P&gt;&lt;P&gt;|          1|25455|&lt;/P&gt;&lt;P&gt;|          2|20330|&lt;/P&gt;&lt;P&gt;|          3| 1776|&lt;/P&gt;&lt;P&gt;|          4| 2868|&lt;/P&gt;&lt;P&gt;|          5| 1251|&lt;/P&gt;&lt;P&gt;|          6| 1145|&lt;/P&gt;&lt;P&gt;|          7|  127&lt;/P&gt;</description>
      <pubDate>Fri, 31 Mar 2023 10:07:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6818#M2827</guid>
      <dc:creator>sanjay</dc:creator>
      <dc:date>2023-03-31T10:07:04Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6819#M2828</link>
      <description>&lt;P&gt;that is pretty skewed, however: that does not explain why there is no parallelism.&lt;/P&gt;&lt;P&gt;The only reasons I see is that either:&lt;/P&gt;&lt;P&gt;-the merge only hits one partition&lt;/P&gt;&lt;P&gt;-you apply a coalesce(1) or repartition(1) somewhere&lt;/P&gt;</description>
      <pubDate>Fri, 31 Mar 2023 10:12:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6819#M2828</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-03-31T10:12:06Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6820#M2829</link>
      <description>&lt;P&gt;As there are 8 partition and this is the same data I need to merge.&lt;/P&gt;&lt;P&gt;How to check how many partitions are used by merge. It should use all 8 partition.&lt;/P&gt;&lt;P&gt;No, I hav't used coalesce or repartition.&lt;/P&gt;&lt;P&gt;Is it possible to connect live,  I can show you the code.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 31 Mar 2023 10:22:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6820#M2829</guid>
      <dc:creator>sanjay</dc:creator>
      <dc:date>2023-03-31T10:22:01Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6821#M2830</link>
      <description>&lt;P&gt;in the history of the delta table you can see how many files have been rewritten (in the operation metrics column).&lt;/P&gt;&lt;P&gt;There are statistics like numtargetfilesadded and numtargetfilesremoved etc.&lt;/P&gt;&lt;P&gt;The fact that your source dataframe (so the incoming data) has 8 partitions, does not mean that the delta lake table also will update 8 partitions.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 31 Mar 2023 10:29:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6821#M2830</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-03-31T10:29:29Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6822#M2831</link>
      <description>&lt;P&gt;Delta table has same columns used for source table and should have 8 partitions,&lt;/P&gt;</description>
      <pubDate>Fri, 31 Mar 2023 10:46:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6822#M2831</guid>
      <dc:creator>sanjay</dc:creator>
      <dc:date>2023-03-31T10:46:30Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6823#M2832</link>
      <description>&lt;P&gt;the number of partitions in the delta table is not relevant, what is relevant is how many partitions or files are affected by the merge.&lt;/P&gt;&lt;P&gt;That can be dispayed in the delta history.&lt;/P&gt;&lt;P&gt;databricks can also apply optimizations while writing, so it is possible that it decides to write a single file instead of 8.  writing will be worse but reading will be faster.&lt;/P&gt;</description>
      <pubDate>Fri, 31 Mar 2023 10:50:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6823#M2832</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-03-31T10:50:48Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6824#M2833</link>
      <description>&lt;P&gt;Any suggestion to improve the performance. Is there any parameter configuration to optimize this. Any document on how to debug in Spark UI&lt;/P&gt;</description>
      <pubDate>Fri, 31 Mar 2023 11:01:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6824#M2833</guid>
      <dc:creator>sanjay</dc:creator>
      <dc:date>2023-03-31T11:01:04Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6825#M2834</link>
      <description>&lt;P&gt;there are several methods:&lt;/P&gt;&lt;P&gt;you can disable optimizations (see the databricks delta lake performance optimization help files) but I would advise against that.&lt;/P&gt;&lt;P&gt;Databricks default settings of the most recent runtimes are pretty optimized IMO.  You can write fast using 80 cpus (so 80 partitions) but that will have a negative performance impact when reading this data.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Semantic Partitioning of the delta table is certainly a good idea (if not already done). And there is also Z-ORDER.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;There is no simple answer to this.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If your merge in the end will work in parallel, you also have to take the data skew into account.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Debugging is really hard if almost impossible in spark due to the parallel nature of the application.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 31 Mar 2023 11:10:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6825#M2834</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-03-31T11:10:31Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6826#M2835</link>
      <description>&lt;P&gt;Hi @Sanjay Jain​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope everything is going great.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Cheers!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 01 Apr 2023 02:12:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6826#M2835</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-04-01T02:12:40Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6827#M2836</link>
      <description>&lt;P&gt;Hi Vidula,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am not able to find right solution to this problem. Appreciate if you can provide any help.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Sanjay&lt;/P&gt;</description>
      <pubDate>Mon, 03 Apr 2023 07:02:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/6827#M2836</guid>
      <dc:creator>sanjay</dc:creator>
      <dc:date>2023-04-03T07:02:45Z</dc:date>
    </item>
    <item>
      <title>Re: Spark tasks too slow and not doing parellel processing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/80324#M35990</link>
      <description>&lt;P&gt;Will it be any different if using Spark but within Azure, i.e. faster?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Jul 2024 11:07:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-tasks-too-slow-and-not-doing-parellel-processing/m-p/80324#M35990</guid>
      <dc:creator>plondon</dc:creator>
      <dc:date>2024-07-24T11:07:00Z</dc:date>
    </item>
  </channel>
</rss>

