cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Size of each partitioned file (partitioned by default)

Hoping
New Contributor

When I try a describe detail I get the number of files the delta table is partitioned into. 

How can I check the size of each file of these files that make up my entire table ?

Will I be able to query each partitioned file to understand how they have been split ?  By default are they split of any particular column?

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @Hoping, Certainly! Let’s explore how you can check the size of each partitioned file in a Delta table and understand how they are split:

 

Partitioning in Delta Tables:

  • Delta tables can be partitioned by a specific column. The most commonly used partition column is often the date column.
  • When deciding on a partition column, consider the following rules of thumb:
    • Cardinality: If a column has very high cardinality (e.g., millions of distinct values), avoid using it for partitioning.
    • Data Amount: Partition by a column if you expect data in that partition to be at least 1 GB.

Checking File Sizes:

To check the size of each file in a Delta table, you can use the dbutils.fs utility in Databricks. Specifically, you can list the files in the table’s directory and retrieve their sizes.

Here’s an example of how to list the files and their sizes for a Delta table:

  • # Specify the path to your Delta table delta_table_path = "/path/to/your/delta_table"
  • # List files in the table directory file_list = dbutils.fs.ls(delta_table_path)
  • # Print file names and sizes for file_info in file_list:    file_name = file_info.name    file_size = file_info.size    print(f"File: {file_name}, Size: {file_size} bytes")

Querying Partitioned Files:

By default, Delta tables are partitioned based on the specified partition column. Each partition corresponds to a subdirectory within the table’s directory.

You can query specific partitions using the WHERE clause in SQL. 

 

For example:

  • SELECT * FROM delta_table WHERE date_column = '2023-11-01'

Replace date_column with the actual name of your partition column and ’2023-11-01’ with the desired partition value.

 

Understanding Default Partitioning:

  • By default, Delta tables are not split into specific partitions unless you explicitly choose a partition column.
  • If no partition column is specified, the data is stored in a single partition (i.e., a single directory).
  • When you choose a partition column, Delta automatically organizes the data into subdirectories based on the values of that column.

Remember to adapt the code snippet above to your specific Delta table and partition column. If you have any further questions or need additional assistance, feel free to ask! 😊.