<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Understanding Delta Table Partition Size Distribution Using the Delta Log in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/understanding-delta-table-partition-size-distribution-using-the/m-p/148497#M1013</link>
    <description>&lt;P&gt;When working with &lt;STRONG&gt;externally managed Delta tables&lt;/STRONG&gt; and traditional partitioning strategies (for example by day, week, or month), one common challenge is:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;EM&gt;How large are my partitions actually?&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Before deciding whether to partition by &lt;STRONG&gt;day vs. week vs. month&lt;/STRONG&gt;, it’s important to understand how data is physically distributed across partitions.&lt;/P&gt;&lt;P&gt;This article shows how to extract &lt;STRONG&gt;partition-level size statistics directly from the Delta transaction log&lt;/STRONG&gt;.&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;span class="lia-unicode-emoji" title=":warning:"&gt;⚠️&lt;/span&gt;This approach is useful when using traditional partitioning and external table strategies.&lt;BR /&gt;If you’re using &lt;STRONG&gt;Unity Catalog managed tables&lt;/STRONG&gt; or &lt;STRONG&gt;Liquid Clustering&lt;/STRONG&gt;, partition sizing decisions are handled differently.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;H2&gt;Why Look at the Delta Log?&lt;/H2&gt;&lt;P&gt;Every Delta table maintains a _delta_log directory containing JSON transaction files.&lt;BR /&gt;Each add action inside the log includes:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;File path&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;File size (in bytes)&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Partition values&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;By reading these logs, we can compute:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Number of files per partition&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Total bytes per partition&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Size distribution across partitions&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;This gives direct visibility into the &lt;STRONG&gt;physical layout&lt;/STRONG&gt; of your table.&lt;/P&gt;&lt;H2&gt;Example: Compute Partition Sizes by startDate&lt;/H2&gt;&lt;PRE&gt;from pyspark.sql import functions as F

delta_log = spark.read.json(
    "abfss://container@storage-account.dfs.core.windows.net/table_location/_delta_log/*.json"
)

files = (
    delta_log
    .filter("add is not null")
    .select(
        F.col("add.path").alias("path"),
        F.col("add.size").alias("size"),
        F.col("add.partitionValues.startDate").alias("startDate")
    )
)

(
    files.groupBy("startDate")
    .agg(
        F.count("*").alias("numFiles"),
        F.sum("size").alias("totalBytes")
    )
    .withColumn("sizeGB", F.col("totalBytes") / (1024**3))
    .orderBy("startDate", ascending=False)
    .show(20, False)
)&lt;/PRE&gt;&lt;H3&gt;Sample Output&lt;/H3&gt;&lt;PRE&gt;+----------+--------+-----------+--------+
|startDate |numFiles|totalBytes |sizeGB  |
+----------+--------+-----------+--------+
|2026-02-16|  xxx   |   xxx     |   xx   |
|2026-02-15|  xxx   |   xxx     |   xx   |
+----------+--------+-----------+--------+&lt;/PRE&gt;&lt;H2&gt;How This Helps With Partition Strategy&lt;/H2&gt;&lt;P&gt;Once you have partition size metrics, you can evaluate:&lt;/P&gt;&lt;H3&gt;Are partitions too small?&lt;/H3&gt;&lt;P&gt;If daily partitions are only a few MB:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;You may be over-partitioning.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Consider partitioning by week or month instead.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;H3&gt;Are partitions too large?&lt;/H3&gt;&lt;P&gt;If partitions exceed hundreds of GB:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Queries may scan too much data.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Consider a finer-grained partitioning strategy.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;H3&gt;Are files unevenly distributed?&lt;/H3&gt;&lt;P&gt;High numFiles with small average size indicates small file issues.&lt;/P&gt;&lt;H2&gt;Decision Guidelines&lt;/H2&gt;&lt;P&gt;Partition Size (per partition)Recommendation&lt;/P&gt;&lt;P&gt;&amp;lt; 1 GB Likely over-partitioned&lt;/P&gt;&lt;P&gt;1–20 GB Usually healthy&lt;/P&gt;&lt;P&gt;50+ GBConsider finer partitioning&lt;/P&gt;&lt;P&gt;100+ GB May impact performance&lt;/P&gt;&lt;P&gt;&lt;EM&gt;(Adjust based on workload and query patterns.)&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;(Adjust based on workload and query patterns.)&lt;/EM&gt;&lt;/P&gt;&lt;H2&gt;Bonus: Average File Size Per Partition&lt;/H2&gt;&lt;P&gt;You can extend the analysis:&lt;/P&gt;&lt;PRE&gt;.withColumn("avgFileSizeMB", (F.col("totalBytes") / F.col("numFiles")) / (1024**2))&lt;/PRE&gt;&lt;P&gt;This helps detect small file problems inside partitions.&lt;/P&gt;&lt;H2&gt;When Should You Use This?&lt;/H2&gt;&lt;P&gt;This approach is particularly useful when:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Using &lt;STRONG&gt;external Delta tables&lt;/STRONG&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Managing your own storage layout&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Designing a new partition strategy&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Migrating legacy Hive-style tables&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Troubleshooting performance issues&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;It gives a &lt;STRONG&gt;low-level, transparent view&lt;/STRONG&gt; of how data is physically stored.&lt;/P&gt;&lt;H2&gt;Final Thoughts&lt;/H2&gt;&lt;P&gt;Partitioning decisions should be based on:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Query access patterns&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Partition cardinality&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Physical partition size&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;File size distribution&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Reading the Delta transaction log provides a simple yet powerful way to understand your table layout — before committing to a partitioning strategy.&lt;/P&gt;</description>
    <pubDate>Mon, 16 Feb 2026 10:54:23 GMT</pubDate>
    <dc:creator>Kirankumarbs</dc:creator>
    <dc:date>2026-02-16T10:54:23Z</dc:date>
    <item>
      <title>Understanding Delta Table Partition Size Distribution Using the Delta Log</title>
      <link>https://community.databricks.com/t5/community-articles/understanding-delta-table-partition-size-distribution-using-the/m-p/148497#M1013</link>
      <description>&lt;P&gt;When working with &lt;STRONG&gt;externally managed Delta tables&lt;/STRONG&gt; and traditional partitioning strategies (for example by day, week, or month), one common challenge is:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;EM&gt;How large are my partitions actually?&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Before deciding whether to partition by &lt;STRONG&gt;day vs. week vs. month&lt;/STRONG&gt;, it’s important to understand how data is physically distributed across partitions.&lt;/P&gt;&lt;P&gt;This article shows how to extract &lt;STRONG&gt;partition-level size statistics directly from the Delta transaction log&lt;/STRONG&gt;.&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;span class="lia-unicode-emoji" title=":warning:"&gt;⚠️&lt;/span&gt;This approach is useful when using traditional partitioning and external table strategies.&lt;BR /&gt;If you’re using &lt;STRONG&gt;Unity Catalog managed tables&lt;/STRONG&gt; or &lt;STRONG&gt;Liquid Clustering&lt;/STRONG&gt;, partition sizing decisions are handled differently.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;H2&gt;Why Look at the Delta Log?&lt;/H2&gt;&lt;P&gt;Every Delta table maintains a _delta_log directory containing JSON transaction files.&lt;BR /&gt;Each add action inside the log includes:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;File path&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;File size (in bytes)&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Partition values&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;By reading these logs, we can compute:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Number of files per partition&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Total bytes per partition&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Size distribution across partitions&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;This gives direct visibility into the &lt;STRONG&gt;physical layout&lt;/STRONG&gt; of your table.&lt;/P&gt;&lt;H2&gt;Example: Compute Partition Sizes by startDate&lt;/H2&gt;&lt;PRE&gt;from pyspark.sql import functions as F

delta_log = spark.read.json(
    "abfss://container@storage-account.dfs.core.windows.net/table_location/_delta_log/*.json"
)

files = (
    delta_log
    .filter("add is not null")
    .select(
        F.col("add.path").alias("path"),
        F.col("add.size").alias("size"),
        F.col("add.partitionValues.startDate").alias("startDate")
    )
)

(
    files.groupBy("startDate")
    .agg(
        F.count("*").alias("numFiles"),
        F.sum("size").alias("totalBytes")
    )
    .withColumn("sizeGB", F.col("totalBytes") / (1024**3))
    .orderBy("startDate", ascending=False)
    .show(20, False)
)&lt;/PRE&gt;&lt;H3&gt;Sample Output&lt;/H3&gt;&lt;PRE&gt;+----------+--------+-----------+--------+
|startDate |numFiles|totalBytes |sizeGB  |
+----------+--------+-----------+--------+
|2026-02-16|  xxx   |   xxx     |   xx   |
|2026-02-15|  xxx   |   xxx     |   xx   |
+----------+--------+-----------+--------+&lt;/PRE&gt;&lt;H2&gt;How This Helps With Partition Strategy&lt;/H2&gt;&lt;P&gt;Once you have partition size metrics, you can evaluate:&lt;/P&gt;&lt;H3&gt;Are partitions too small?&lt;/H3&gt;&lt;P&gt;If daily partitions are only a few MB:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;You may be over-partitioning.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Consider partitioning by week or month instead.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;H3&gt;Are partitions too large?&lt;/H3&gt;&lt;P&gt;If partitions exceed hundreds of GB:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Queries may scan too much data.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Consider a finer-grained partitioning strategy.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;H3&gt;Are files unevenly distributed?&lt;/H3&gt;&lt;P&gt;High numFiles with small average size indicates small file issues.&lt;/P&gt;&lt;H2&gt;Decision Guidelines&lt;/H2&gt;&lt;P&gt;Partition Size (per partition)Recommendation&lt;/P&gt;&lt;P&gt;&amp;lt; 1 GB Likely over-partitioned&lt;/P&gt;&lt;P&gt;1–20 GB Usually healthy&lt;/P&gt;&lt;P&gt;50+ GBConsider finer partitioning&lt;/P&gt;&lt;P&gt;100+ GB May impact performance&lt;/P&gt;&lt;P&gt;&lt;EM&gt;(Adjust based on workload and query patterns.)&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;(Adjust based on workload and query patterns.)&lt;/EM&gt;&lt;/P&gt;&lt;H2&gt;Bonus: Average File Size Per Partition&lt;/H2&gt;&lt;P&gt;You can extend the analysis:&lt;/P&gt;&lt;PRE&gt;.withColumn("avgFileSizeMB", (F.col("totalBytes") / F.col("numFiles")) / (1024**2))&lt;/PRE&gt;&lt;P&gt;This helps detect small file problems inside partitions.&lt;/P&gt;&lt;H2&gt;When Should You Use This?&lt;/H2&gt;&lt;P&gt;This approach is particularly useful when:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Using &lt;STRONG&gt;external Delta tables&lt;/STRONG&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Managing your own storage layout&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Designing a new partition strategy&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Migrating legacy Hive-style tables&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Troubleshooting performance issues&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;It gives a &lt;STRONG&gt;low-level, transparent view&lt;/STRONG&gt; of how data is physically stored.&lt;/P&gt;&lt;H2&gt;Final Thoughts&lt;/H2&gt;&lt;P&gt;Partitioning decisions should be based on:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Query access patterns&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Partition cardinality&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Physical partition size&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;File size distribution&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Reading the Delta transaction log provides a simple yet powerful way to understand your table layout — before committing to a partitioning strategy.&lt;/P&gt;</description>
      <pubDate>Mon, 16 Feb 2026 10:54:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/understanding-delta-table-partition-size-distribution-using-the/m-p/148497#M1013</guid>
      <dc:creator>Kirankumarbs</dc:creator>
      <dc:date>2026-02-16T10:54:23Z</dc:date>
    </item>
  </channel>
</rss>

