<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Performance enhancement while writing dataframes into Parquet tables in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/9541#M4880</link>
    <description>&lt;P&gt;Hi @Souradipta Sen​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 13 Feb 2023 07:15:40 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2023-02-13T07:15:40Z</dc:date>
    <item>
      <title>Performance enhancement while writing dataframes into Parquet tables</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/9539#M4878</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am trying to write the contents of a dataframe into a parquet table using the command below.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;df.write.mode("overwrite").format("parquet").saveAsTable("sample_parquet_table")&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The dataframe contains an extract from one of our source systems, which happens to be a Postgres database, and was prepared using a SQL statement. The data count would approximately be around 0.3M records. The target table is parquet and I have tried writing in overwrite mode.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The problem is, this statement keeps on running with no progress and automatically gets timed out after hours. As part of our requirement, we can afford a maximum of ~10mins to get this written into the target.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is there a way to improve the performance? Or atleast understand where the problem lies? The target can be changed to a "delta" and can also be partitioned if needed.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 11 Feb 2023 16:57:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/9539#M4878</guid>
      <dc:creator>Sen</dc:creator>
      <dc:date>2023-02-11T16:57:24Z</dc:date>
    </item>
    <item>
      <title>Re: Performance enhancement while writing dataframes into Parquet tables</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/9540#M4879</link>
      <description>&lt;P&gt;I think you can create partitioned and store it as delta table and optimize table using Zorder,&lt;/P&gt;&lt;P&gt;May  i know your cluster configurations as well ?&lt;/P&gt;</description>
      <pubDate>Sun, 12 Feb 2023 14:21:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/9540#M4879</guid>
      <dc:creator>mk1987c</dc:creator>
      <dc:date>2023-02-12T14:21:12Z</dc:date>
    </item>
    <item>
      <title>Re: Performance enhancement while writing dataframes into Parquet tables</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/9541#M4880</link>
      <description>&lt;P&gt;Hi @Souradipta Sen​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 13 Feb 2023 07:15:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/9541#M4880</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-02-13T07:15:40Z</dc:date>
    </item>
    <item>
      <title>Re: Performance enhancement while writing dataframes into Parquet tables</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/9542#M4881</link>
      <description>&lt;P&gt;I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta&lt;/P&gt;</description>
      <pubDate>Thu, 23 Feb 2023 18:28:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/9542#M4881</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2023-02-23T18:28:33Z</dc:date>
    </item>
    <item>
      <title>Re: Performance enhancement while writing dataframes into Parquet tables</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/49685#M28603</link>
      <description>&lt;P&gt;Thanks for telling me, I didn't even know about it.&lt;/P&gt;</description>
      <pubDate>Sun, 22 Oct 2023 00:36:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/49685#M28603</guid>
      <dc:creator>Raluka</dc:creator>
      <dc:date>2023-10-22T00:36:59Z</dc:date>
    </item>
    <item>
      <title>Re: Performance enhancement while writing dataframes into Parquet tables</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/49780#M28621</link>
      <description>&lt;P&gt;Many students are faced with the fact that they do not know where to turn for help with writing an essay. Sometimes it takes most of the day and is really tedious. I can only emphasize from myself that now all this is a solvable task. For example, you can hire professionals who write my essay for me&amp;nbsp;&lt;A href="https://ukwritings.com/write-my-essay" target="_self"&gt;https://ukwritings.com/write-my-essay&lt;/A&gt;&amp;nbsp; and get an amazing result very quickly and easily. I hope you will find this useful.&lt;/P&gt;</description>
      <pubDate>Tue, 24 Oct 2023 02:45:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/49780#M28621</guid>
      <dc:creator>Raluka</dc:creator>
      <dc:date>2023-10-24T02:45:05Z</dc:date>
    </item>
    <item>
      <title>Re: Performance enhancement while writing dataframes into Parquet tables</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/63170#M32190</link>
      <description>&lt;P&gt;Imho this issue may cause by SQL query which generate your DF - Queries are laze operation and starts when you need data - in this case - when you write DF to table(0.3M rows is nothing for Spark). So it's not write cause this issue but query - rewrite it for performance and all will work fast.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Have a nice day!&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 10 Mar 2024 18:33:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/63170#M32190</guid>
      <dc:creator>Kyrylo_Ozz</dc:creator>
      <dc:date>2024-03-10T18:33:34Z</dc:date>
    </item>
    <item>
      <title>Re: Performance enhancement while writing dataframes into Parquet tables</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/63197#M32191</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I agree with the reply around the benefits of Delta tables, specifically Delta brings additional features,&lt;BR /&gt;such as ACID transactions and schema evolution. However, I am not sure whether the problem below and I quote "&lt;FONT face="courier new,courier" size="3"&gt;&lt;SPAN&gt;&lt;FONT size="2"&gt;The problem is, this statement keeps on running with no progress and automatically gets timed out after hours. As part of our requirement, we can afford a maximum of ~10mins to get this written into the target&lt;/FONT&gt;."&amp;nbsp;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;FONT face="arial,helvetica,sans-serif" size="4"&gt;&lt;SPAN&gt;is going to be reduced etc. The fundamental considerations for optimizing write operations,&lt;BR /&gt;especially those involving shuffling and partitioning, remain similar to Parquet for Delta.&amp;nbsp;When you use partitionedBy during a write operation in Spark, it involves a shuffle operation to redistribute the data across the specified partitions.&amp;nbsp;This is true for both Parquet and Delta tables because they both rely on Spark engine for data processing.&amp;nbsp;My inclination would be to observe from Spark UI (4040), and the staging tab where insert (job) are taking longest . Evaluate the metrics related to tasks, such as input/output data size, shuffle read/write, and CPU time. Both SQL and executors tabs will also help in pinpointing the issue. You&amp;nbsp;can also use compression&amp;nbsp;like SNAPPY to reduce volume of writes.&lt;BR /&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier" size="2"&gt;df.write.option("compression", "snappy").mode("overwrite").format("parquet").saveAsTable("table_name")&lt;/FONT&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif" size="4"&gt;&lt;SPAN&gt;HTH&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;"&lt;/P&gt;</description>
      <pubDate>Sun, 10 Mar 2024 20:31:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/63197#M32191</guid>
      <dc:creator>MichTalebzadeh</dc:creator>
      <dc:date>2024-03-10T20:31:14Z</dc:date>
    </item>
    <item>
      <title>Re: Performance enhancement while writing dataframes into Parquet tables</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/63377#M32221</link>
      <description>&lt;P&gt;&lt;SPAN&gt;With regard to point below that has been accepted as a solution&lt;BR /&gt;"I will highly recommend to save your data as Delta instead of parquet. There are many extra benefits in Delta"&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The fundamental considerations for optimizing write operations,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;especially those involving shuffling and partitioning, remain similar to Parquet for Delta.&amp;nbsp;When you use partitionedBy during a write operation in Spark, it involves a shuffle operation to redistribute the data across the specified partitions.&amp;nbsp;This is true for both Parquet and Delta tables because they both rely on Spark engine for data processing.&amp;nbsp;My inclination would be to observe from Spark UI (4040), and the staging tab where insert (job) are taking longest . Evaluate the metrics related to tasks, such as input/output data size, shuffle read/write, and CPU time. Both SQL and executors tabs will also help in pinpointing the issue. You&amp;nbsp;can also use compression&amp;nbsp;like SNAPPY to reduce volume of writes.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;HTH&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Mar 2024 12:28:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/63377#M32221</guid>
      <dc:creator>MichTalebzadeh</dc:creator>
      <dc:date>2024-03-12T12:28:32Z</dc:date>
    </item>
    <item>
      <title>Re: Performance enhancement while writing dataframes into Parquet tables</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/100771#M40418</link>
      <description>&lt;P&gt;Great discussion on performance optimization! Managing technical projects like these alongside academic work can be demanding. If you need expert academic support to free up time for your professional pursuits, &lt;A href="https://dissertation-help-services.co.uk/" target="_new" rel="noopener"&gt;&lt;SPAN&gt;Dissertation&lt;/SPAN&gt;&lt;SPAN&gt; Help&lt;/SPAN&gt;&lt;SPAN&gt; Services&lt;/SPAN&gt;&lt;/A&gt; is here to assist. Balance your workload effectively and keep achieving excellence!&lt;/P&gt;</description>
      <pubDate>Tue, 03 Dec 2024 14:47:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/100771#M40418</guid>
      <dc:creator>jhoon</dc:creator>
      <dc:date>2024-12-03T14:47:24Z</dc:date>
    </item>
    <item>
      <title>Re: Performance enhancement while writing dataframes into Parquet tables</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/123342#M46987</link>
      <description>&lt;P&gt;I am Bob Clarke marketing manager of virtual assistants Pakistan and I help companies &lt;A href="https://virtualassistantspakistan.com/amazon-account-management/" target="_blank" rel="noopener"&gt;hire amazon virtual assistants&lt;/A&gt; who manage product listings order processing and inventory updates. Our trained staff improves efficiency and boosts sales. We support your store around the clock.&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jun 2025 18:24:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-enhancement-while-writing-dataframes-into-parquet/m-p/123342#M46987</guid>
      <dc:creator>BobClarke</dc:creator>
      <dc:date>2025-06-30T18:24:07Z</dc:date>
    </item>
  </channel>
</rss>

