Databricks Community

Kirankumarbs · ‎02-16-2026

When working with externally managed Delta tables and traditional partitioning strategies (for example by day, week, or month), one common challenge is:

How large are my partitions actually?

Before deciding whether to partition by day vs. week vs. month, it’s important to understand how data is physically distributed across partitions.

This article shows how to extract partition-level size statistics directly from the Delta transaction log.

⚠️This approach is useful when using traditional partitioning and external table strategies.
If you’re using Unity Catalog managed tables or Liquid Clustering, partition sizing decisions are handled differently.

Why Look at the Delta Log?

Every Delta table maintains a _delta_log directory containing JSON transaction files.
Each add action inside the log includes:

File path
File size (in bytes)
Partition values

By reading these logs, we can compute:

Number of files per partition
Total bytes per partition
Size distribution across partitions

This gives direct visibility into the physical layout of your table.

Example: Compute Partition Sizes by startDate

from pyspark.sql import functions as F

delta_log = spark.read.json(
    "abfss://container@storage-account.dfs.core.windows.net/table_location/_delta_log/*.json"
)

files = (
    delta_log
    .filter("add is not null")
    .select(
        F.col("add.path").alias("path"),
        F.col("add.size").alias("size"),
        F.col("add.partitionValues.startDate").alias("startDate")
    )
)

(
    files.groupBy("startDate")
    .agg(
        F.count("*").alias("numFiles"),
        F.sum("size").alias("totalBytes")
    )
    .withColumn("sizeGB", F.col("totalBytes") / (1024**3))
    .orderBy("startDate", ascending=False)
    .show(20, False)
)

Sample Output

+----------+--------+-----------+--------+
|startDate |numFiles|totalBytes |sizeGB  |
+----------+--------+-----------+--------+
|2026-02-16|  xxx   |   xxx     |   xx   |
|2026-02-15|  xxx   |   xxx     |   xx   |
+----------+--------+-----------+--------+

How This Helps With Partition Strategy

Once you have partition size metrics, you can evaluate:

Are partitions too small?

If daily partitions are only a few MB:

You may be over-partitioning.
Consider partitioning by week or month instead.

Are partitions too large?

If partitions exceed hundreds of GB:

Queries may scan too much data.
Consider a finer-grained partitioning strategy.

Are files unevenly distributed?

High numFiles with small average size indicates small file issues.

Decision Guidelines

Partition Size (per partition)Recommendation

< 1 GB Likely over-partitioned

1–20 GB Usually healthy

50+ GBConsider finer partitioning

100+ GB May impact performance

(Adjust based on workload and query patterns.)

Bonus: Average File Size Per Partition

You can extend the analysis:

.withColumn("avgFileSizeMB", (F.col("totalBytes") / F.col("numFiles")) / (1024**2))

This helps detect small file problems inside partitions.

When Should You Use This?

This approach is particularly useful when:

Using external Delta tables
Managing your own storage layout
Designing a new partition strategy
Migrating legacy Hive-style tables
Troubleshooting performance issues

It gives a low-level, transparent view of how data is physically stored.

Final Thoughts

Partitioning decisions should be based on:

Query access patterns
Partition cardinality
Physical partition size
File size distribution

Reading the Delta transaction log provides a simple yet powerful way to understand your table layout — before committing to a partitioning strategy.

Databricks Community

Understanding Delta Table Partition Size Distribution Using the Delta Log

Why Look at the Delta Log?

Example: Compute Partition Sizes by startDate

Sample Output

How This Helps With Partition Strategy

Are partitions too small?

Are partitions too large?

Are files unevenly distributed?

Decision Guidelines

Bonus: Average File Size Per Partition

When Should You Use This?

Final Thoughts

🌟 Community Pulse: Your Weekly Roundup! June 22 – 28, 2026

Solution Accelerator Series | Product Quality Inspection

Upcoming Community BrickTalk: Bringing (Geo)Spatial Awareness to your Conversational Agents

Databricks Community Champion - June 2026 - Amira Bedhiafi

Build apps without jumping through hoops