<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Does too many parquet files in delta table impact writes for the streaming job in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/126040#M47622</link>
    <description>&lt;P&gt;Understood.&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/176082"&gt;@VaderKB&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Please check this documentation around this :&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/delta/best-practices" target="_blank"&gt;https://docs.databricks.com/aws/en/delta/best-practices&lt;/A&gt;&lt;BR /&gt;&lt;SPAN&gt;Databricks recommends frequently running the&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://docs.databricks.com/aws/en/delta/optimize" target="_blank"&gt;OPTIMIZE&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;command to compact small files.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Please try to run this:&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;-- For a non-partitioned table&lt;BR /&gt;OPTIMIZE my_delta_table;&lt;/P&gt;&lt;P&gt;-- For a table partitioned by 'date', optimizing the last 2 days&lt;BR /&gt;OPTIMIZE my_delta_table WHERE date &amp;gt;= current_date() - INTERVAL 2 DAY;&lt;/P&gt;&lt;P&gt;I am waiting other in the forum to confirm this solution.&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/176082"&gt;@VaderKB&lt;/a&gt;&amp;nbsp; I am sure this is not production environment right?&lt;/P&gt;</description>
    <pubDate>Tue, 22 Jul 2025 18:36:57 GMT</pubDate>
    <dc:creator>Khaja_Zaffer</dc:creator>
    <dc:date>2025-07-22T18:36:57Z</dc:date>
    <item>
      <title>Does too many parquet files in delta table impact writes for the streaming job</title>
      <link>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/125978#M47597</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am running a spark streaming job that reads data from AWS Kinesis and writes data to extrenal delta tables which are stored in S3. But I have noticed that over the time, the latency has been increasing. I also noticed that for each batch, the addBatch and commitBatch time has been increasing. I am writing to table in append mode.&lt;/P&gt;&lt;P&gt;I did run an OPTIMISE and my latency improved along with reductions in&amp;nbsp;addBatch and commitBatch duration.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I know that too many small files reduces read performance, but my question is does too many small streaming parquet files in delta table impact writes for the streaming job? But&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 22 Jul 2025 09:53:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/125978#M47597</guid>
      <dc:creator>VaderKB</dc:creator>
      <dc:date>2025-07-22T09:53:18Z</dc:date>
    </item>
    <item>
      <title>Re: Does too many parquet files in delta table impact writes for the streaming job</title>
      <link>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/125985#M47600</link>
      <description>&lt;P&gt;But? please share your whole query.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 22 Jul 2025 11:29:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/125985#M47600</guid>
      <dc:creator>Khaja_Zaffer</dc:creator>
      <dc:date>2025-07-22T11:29:59Z</dc:date>
    </item>
    <item>
      <title>Re: Does too many parquet files in delta table impact writes for the streaming job</title>
      <link>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/125992#M47605</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/173840"&gt;@Khaja_Zaffer&lt;/a&gt;&amp;nbsp;Sorry the &lt;STRONG&gt;"But"&amp;nbsp;&lt;/STRONG&gt;is a typo. The query is simple: I read using readstream, extract the data and expand the json into a table structure and write the data back using writestream with append mode ,checkpoints and mergeSchema as True. It is something like:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;(data
.writeStream
.outputMode("append")
.format("delta")
.queryName("query_name")
.option("checkpointLocation", checkpoint_location)
.option("header", "true")
.option("mergeSchema", "true")
.toTable(table_name)
)&lt;/LI-CODE&gt;&lt;P&gt;I cannot share the complete query, but it is quite straightforward. And it becomes slower and slower as the table size grows.&lt;/P&gt;</description>
      <pubDate>Tue, 22 Jul 2025 12:15:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/125992#M47605</guid>
      <dc:creator>VaderKB</dc:creator>
      <dc:date>2025-07-22T12:15:42Z</dc:date>
    </item>
    <item>
      <title>Re: Does too many parquet files in delta table impact writes for the streaming job</title>
      <link>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/126040#M47622</link>
      <description>&lt;P&gt;Understood.&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/176082"&gt;@VaderKB&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Please check this documentation around this :&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/delta/best-practices" target="_blank"&gt;https://docs.databricks.com/aws/en/delta/best-practices&lt;/A&gt;&lt;BR /&gt;&lt;SPAN&gt;Databricks recommends frequently running the&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://docs.databricks.com/aws/en/delta/optimize" target="_blank"&gt;OPTIMIZE&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;command to compact small files.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Please try to run this:&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;-- For a non-partitioned table&lt;BR /&gt;OPTIMIZE my_delta_table;&lt;/P&gt;&lt;P&gt;-- For a table partitioned by 'date', optimizing the last 2 days&lt;BR /&gt;OPTIMIZE my_delta_table WHERE date &amp;gt;= current_date() - INTERVAL 2 DAY;&lt;/P&gt;&lt;P&gt;I am waiting other in the forum to confirm this solution.&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/176082"&gt;@VaderKB&lt;/a&gt;&amp;nbsp; I am sure this is not production environment right?&lt;/P&gt;</description>
      <pubDate>Tue, 22 Jul 2025 18:36:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/126040#M47622</guid>
      <dc:creator>Khaja_Zaffer</dc:creator>
      <dc:date>2025-07-22T18:36:57Z</dc:date>
    </item>
    <item>
      <title>Re: Does too many parquet files in delta table impact writes for the streaming job</title>
      <link>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/126648#M47728</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/176082"&gt;@VaderKB&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Did it resolve the issue?&lt;/P&gt;</description>
      <pubDate>Mon, 28 Jul 2025 08:33:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/126648#M47728</guid>
      <dc:creator>Khaja_Zaffer</dc:creator>
      <dc:date>2025-07-28T08:33:42Z</dc:date>
    </item>
    <item>
      <title>Re: Does too many parquet files in delta table impact writes for the streaming job</title>
      <link>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/126856#M47784</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/173840"&gt;@Khaja_Zaffer&lt;/a&gt;&amp;nbsp;, Thank you for your answer. Yes this is in production environment. Doing an&amp;nbsp;&lt;STRONG&gt;OPTIMIZE&amp;nbsp;&lt;/STRONG&gt;did reduce the latency. But I don't understand why? Because from what I understand,&amp;nbsp;&lt;STRONG&gt;OPTIMIZE&amp;nbsp;&lt;/STRONG&gt;compacts the file into larger files. And from what I read in the documentation, it should make reads faster as there would be fewer files to read. But there is nothing on writes. However in case of writes with&amp;nbsp;&lt;STRONG&gt;append&amp;nbsp;&lt;/STRONG&gt;mode, it should not be the case as the files are just added in top of each other. So it should not matter how many files already exists because we are just adding more. Not sure how it is implemented under the hood. But if you know why, please do let me know.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Jul 2025 19:18:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/126856#M47784</guid>
      <dc:creator>VaderKB</dc:creator>
      <dc:date>2025-07-29T19:18:19Z</dc:date>
    </item>
    <item>
      <title>Re: Does too many parquet files in delta table impact writes for the streaming job</title>
      <link>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/126878#M47794</link>
      <description>&lt;P&gt;Yes, too many small Parquet files in a Delta table can degrade write performance by increasing metadata overhead during commits. Regularly running OPTIMIZE helps reduce this impact and improve streaming latency.&lt;/P&gt;</description>
      <pubDate>Wed, 30 Jul 2025 04:17:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/126878#M47794</guid>
      <dc:creator>jerrygen78</dc:creator>
      <dc:date>2025-07-30T04:17:07Z</dc:date>
    </item>
    <item>
      <title>Re: Does too many parquet files in delta table impact writes for the streaming job</title>
      <link>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/126908#M47800</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/176082"&gt;@VaderKB&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You're right that OPTIMIZE makes reads faster by reducing the number of files. For writes using append mode, it doesn't directly speed up the operation itself. However, having fewer, larger files from a previous OPTIMIZE run can improve the overall performance of subsequent reads that might be part of a larger job, which could make the entire pipeline seem faster.&lt;BR /&gt;&lt;BR /&gt;You can go in detail from this document&amp;nbsp;&lt;BR /&gt;&lt;A href="https://www.databricks.com/discover/pages/optimize-data-workloads-guide" target="_blank" rel="noopener"&gt;https://www.databricks.com/discover/pages/optimize-data-workloads-guide&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;as the issue is resolved can you please close the case by selecting a solution from this page?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 30 Jul 2025 08:23:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-too-many-parquet-files-in-delta-table-impact-writes-for-the/m-p/126908#M47800</guid>
      <dc:creator>Khaja_Zaffer</dc:creator>
      <dc:date>2025-07-30T08:23:27Z</dc:date>
    </item>
  </channel>
</rss>

