<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Writing to Delta tables/files is taking a long time in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/writing-to-delta-tables-files-is-taking-a-long-time/m-p/46124#M28007</link>
    <description>&lt;P&gt;I have a dataframe that is a series of transformation of big data (167 million rows) and I want to write it to delta files and tables using the below :&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;try:

    (df_new.write.format('delta')

     .option("delta.minReaderVersion", "2")

     .option("delta.minWriterVersion", "5")

     .option("spark.databricks.delta.optimizeWrite.enabled",True)

     .option("delta.columnMapping.mode", "name")

     .mode('overwrite')

     .option("overwriteSchema", True)

     .save(f'/mnt/mymountpoint/Gold_tables/tasoapplans'))

    try:

        df_new.write.insertInto('Gold_tables.tasoapplans', overwrite=True)

    except:

        spark.sql("create table IF NOT EXISTS Gold_tables.tasoapplans using delta location '/mnt/mymountpoint/Gold_tables/tasoapplans'")

except Exception as e:

    dbutils.notebook.exit(str(e))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But the writing is taking too much time(query = 1 hour, writing 1 hour 30 minutes)&lt;BR /&gt;Cluster used is :&lt;BR /&gt;Memory optimized cluster Standard_DS12_v2 (28GB memroy,4 Cores)&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Use photon Acceleration&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;min workers:2 &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;max workers:8&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;How can I improve the writing?&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 25 Sep 2023 11:37:56 GMT</pubDate>
    <dc:creator>wissamimad</dc:creator>
    <dc:date>2023-09-25T11:37:56Z</dc:date>
    <item>
      <title>Writing to Delta tables/files is taking a long time</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-to-delta-tables-files-is-taking-a-long-time/m-p/46124#M28007</link>
      <description>&lt;P&gt;I have a dataframe that is a series of transformation of big data (167 million rows) and I want to write it to delta files and tables using the below :&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;try:

    (df_new.write.format('delta')

     .option("delta.minReaderVersion", "2")

     .option("delta.minWriterVersion", "5")

     .option("spark.databricks.delta.optimizeWrite.enabled",True)

     .option("delta.columnMapping.mode", "name")

     .mode('overwrite')

     .option("overwriteSchema", True)

     .save(f'/mnt/mymountpoint/Gold_tables/tasoapplans'))

    try:

        df_new.write.insertInto('Gold_tables.tasoapplans', overwrite=True)

    except:

        spark.sql("create table IF NOT EXISTS Gold_tables.tasoapplans using delta location '/mnt/mymountpoint/Gold_tables/tasoapplans'")

except Exception as e:

    dbutils.notebook.exit(str(e))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But the writing is taking too much time(query = 1 hour, writing 1 hour 30 minutes)&lt;BR /&gt;Cluster used is :&lt;BR /&gt;Memory optimized cluster Standard_DS12_v2 (28GB memroy,4 Cores)&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Use photon Acceleration&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;min workers:2 &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;max workers:8&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;How can I improve the writing?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 25 Sep 2023 11:37:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-to-delta-tables-files-is-taking-a-long-time/m-p/46124#M28007</guid>
      <dc:creator>wissamimad</dc:creator>
      <dc:date>2023-09-25T11:37:56Z</dc:date>
    </item>
    <item>
      <title>Re: Writing to Delta tables/files is taking a long time</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-to-delta-tables-files-is-taking-a-long-time/m-p/56393#M30539</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;I am having the same issue where i made a inner join on two spark dataframes they are running only a single node not sure how to modify to run on many nodes and same thing with when i write a 30 gb data to a delta table it is almost 3 hours still executing how we can reduce the time&amp;nbsp;&lt;/P&gt;&lt;P&gt;it is simple join of two tables first table has 50 millon records and second table has 300k records and inner join took 20 minutes and I want to save this a new delta table&lt;/P&gt;&lt;P&gt;&amp;nbsp;here is the code&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;result_df &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; Invoice_Data.&lt;/SPAN&gt;&lt;SPAN&gt;join&lt;/SPAN&gt;&lt;SPAN&gt;(Fixed_df, &lt;/SPAN&gt;&lt;SPAN&gt;on&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;[&lt;/SPAN&gt;&lt;SPAN&gt;'Code'&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;'item_no'&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;'supplier_no'&lt;/SPAN&gt;&lt;SPAN&gt;], &lt;/SPAN&gt;&lt;SPAN&gt;how&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;'inner'&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;result_df.write.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"overwriteSchema"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"delta"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;mode&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;save&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"abfss://data@abc.dfs.core.windows.net/features/MCA"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Attached the metrics time&amp;nbsp; let me know how we can optimize it&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 03 Jan 2024 23:12:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-to-delta-tables-files-is-taking-a-long-time/m-p/56393#M30539</guid>
      <dc:creator>prasu1222</dc:creator>
      <dc:date>2024-01-03T23:12:57Z</dc:date>
    </item>
  </channel>
</rss>

