cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Understanding Delta Table Partition Size Distribution Using the Delta Log

Kirankumarbs
Contributor III

When working with externally managed Delta tables and traditional partitioning strategies (for example by day, week, or month), one common challenge is:

How large are my partitions actually?

Before deciding whether to partition by day vs. week vs. month, itโ€™s important to understand how data is physically distributed across partitions.

This article shows how to extract partition-level size statistics directly from the Delta transaction log.

โš ๏ธThis approach is useful when using traditional partitioning and external table strategies.
If youโ€™re using Unity Catalog managed tables or Liquid Clustering, partition sizing decisions are handled differently.

Why Look at the Delta Log?

Every Delta table maintains a _delta_log directory containing JSON transaction files.
Each add action inside the log includes:

  • File path

  • File size (in bytes)

  • Partition values

By reading these logs, we can compute:

  • Number of files per partition

  • Total bytes per partition

  • Size distribution across partitions

This gives direct visibility into the physical layout of your table.

Example: Compute Partition Sizes by startDate

from pyspark.sql import functions as F

delta_log = spark.read.json(
    "abfss://container@storage-account.dfs.core.windows.net/table_location/_delta_log/*.json"
)

files = (
    delta_log
    .filter("add is not null")
    .select(
        F.col("add.path").alias("path"),
        F.col("add.size").alias("size"),
        F.col("add.partitionValues.startDate").alias("startDate")
    )
)

(
    files.groupBy("startDate")
    .agg(
        F.count("*").alias("numFiles"),
        F.sum("size").alias("totalBytes")
    )
    .withColumn("sizeGB", F.col("totalBytes") / (1024**3))
    .orderBy("startDate", ascending=False)
    .show(20, False)
)

Sample Output

+----------+--------+-----------+--------+
|startDate |numFiles|totalBytes |sizeGB  |
+----------+--------+-----------+--------+
|2026-02-16|  xxx   |   xxx     |   xx   |
|2026-02-15|  xxx   |   xxx     |   xx   |
+----------+--------+-----------+--------+

How This Helps With Partition Strategy

Once you have partition size metrics, you can evaluate:

Are partitions too small?

If daily partitions are only a few MB:

  • You may be over-partitioning.

  • Consider partitioning by week or month instead.

Are partitions too large?

If partitions exceed hundreds of GB:

  • Queries may scan too much data.

  • Consider a finer-grained partitioning strategy.

Are files unevenly distributed?

High numFiles with small average size indicates small file issues.

Decision Guidelines

Partition Size (per partition)Recommendation

< 1 GB Likely over-partitioned

1โ€“20 GB Usually healthy

50+ GBConsider finer partitioning

100+ GB May impact performance

(Adjust based on workload and query patterns.)

(Adjust based on workload and query patterns.)

Bonus: Average File Size Per Partition

You can extend the analysis:

.withColumn("avgFileSizeMB", (F.col("totalBytes") / F.col("numFiles")) / (1024**2))

This helps detect small file problems inside partitions.

When Should You Use This?

This approach is particularly useful when:

  • Using external Delta tables

  • Managing your own storage layout

  • Designing a new partition strategy

  • Migrating legacy Hive-style tables

  • Troubleshooting performance issues

It gives a low-level, transparent view of how data is physically stored.

Final Thoughts

Partitioning decisions should be based on:

  • Query access patterns

  • Partition cardinality

  • Physical partition size

  • File size distribution

Reading the Delta transaction log provides a simple yet powerful way to understand your table layout โ€” before committing to a partitioning strategy.

0 REPLIES 0