<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Peformnace improvement of Databricks Spark Job in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/peformnace-improvement-of-databricks-spark-job/m-p/83988#M37092</link>
    <description>&lt;P&gt;Hi,&lt;BR /&gt;I need performance improvement for data bricks job in my project. Here are some steps being done in the project&lt;BR /&gt;1. Read csv/Json files with small size (100MB,50MB) from multiple locations in s3&lt;BR /&gt;2. Write the data in bronze layer in delta/parquet format&lt;BR /&gt;3. Read from bronze layer&lt;BR /&gt;4. Do some filter for data cleaning&lt;BR /&gt;5. Write to silver&amp;nbsp;in delta/parquet format&lt;BR /&gt;6. Read from silver layer&lt;BR /&gt;7. Do lot of joins and other transformations like union, distinct&lt;BR /&gt;8. Write the final data to AWS RDS&lt;BR /&gt;&lt;BR /&gt;I'm not getting enough performance improvement. for 5KB data it is taking almost 1min 30 sec&lt;BR /&gt;Also, I observed that enough parallelism is not there, and all cores are not getting utilized (I have 4 cores)&lt;BR /&gt;&lt;BR /&gt;Please give some suggestions on this&lt;/P&gt;</description>
    <pubDate>Fri, 23 Aug 2024 05:35:53 GMT</pubDate>
    <dc:creator>pinaki1</dc:creator>
    <dc:date>2024-08-23T05:35:53Z</dc:date>
    <item>
      <title>Peformnace improvement of Databricks Spark Job</title>
      <link>https://community.databricks.com/t5/data-engineering/peformnace-improvement-of-databricks-spark-job/m-p/83988#M37092</link>
      <description>&lt;P&gt;Hi,&lt;BR /&gt;I need performance improvement for data bricks job in my project. Here are some steps being done in the project&lt;BR /&gt;1. Read csv/Json files with small size (100MB,50MB) from multiple locations in s3&lt;BR /&gt;2. Write the data in bronze layer in delta/parquet format&lt;BR /&gt;3. Read from bronze layer&lt;BR /&gt;4. Do some filter for data cleaning&lt;BR /&gt;5. Write to silver&amp;nbsp;in delta/parquet format&lt;BR /&gt;6. Read from silver layer&lt;BR /&gt;7. Do lot of joins and other transformations like union, distinct&lt;BR /&gt;8. Write the final data to AWS RDS&lt;BR /&gt;&lt;BR /&gt;I'm not getting enough performance improvement. for 5KB data it is taking almost 1min 30 sec&lt;BR /&gt;Also, I observed that enough parallelism is not there, and all cores are not getting utilized (I have 4 cores)&lt;BR /&gt;&lt;BR /&gt;Please give some suggestions on this&lt;/P&gt;</description>
      <pubDate>Fri, 23 Aug 2024 05:35:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/peformnace-improvement-of-databricks-spark-job/m-p/83988#M37092</guid>
      <dc:creator>pinaki1</dc:creator>
      <dc:date>2024-08-23T05:35:53Z</dc:date>
    </item>
    <item>
      <title>Re: Peformnace improvement of Databricks Spark Job</title>
      <link>https://community.databricks.com/t5/data-engineering/peformnace-improvement-of-databricks-spark-job/m-p/83993#M37094</link>
      <description>&lt;P&gt;In case of performance issues, always look for 'expensive' operations. Mainly wide operations (shuffle) and collecting data to the driver.&lt;BR /&gt;Start with checking how long the bronze part takes, then silver etc.&lt;BR /&gt;Pinpoint where it starts to get slow, then dig into the query plan.&lt;BR /&gt;Chances are that some join slows things down.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Aug 2024 06:30:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/peformnace-improvement-of-databricks-spark-job/m-p/83993#M37094</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-08-23T06:30:35Z</dc:date>
    </item>
  </channel>
</rss>

