cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta table partition directories when column mapping is enabled

Gary_Irick
New Contributor III

I recently created a table on a cluster in Azure running Databricks Runtime 11.1. The table is partitioned by a "date" column. I enabled column mapping, like this:

ALTER TABLE {schema}.{table_name} SET TBLPROPERTIES('delta.columnMapping.mode' = 'name', 'delta.minReaderVersion' = '2', 'delta.minWriterVersion' = '5')

Before enabling column mapping, the directory containing the Delta table has the expected partition directories: "date=2022-08-18", "date=2022-08-19", etc.

After enabling column mapping, every time I do a MERGE into that table, I get new directories created with short names like "5k", "Rw", "Yd", etc. When I VACUUM the table, most of the directories are empty, but the empty directories are not removed. We merge into this table frequently, so the table containing the Delta table ends up with lots and lots of empty directories.

I have 2 questions:

Is it expected that these directories will be created with names other than the expected "date=2022-08-18"?

Is there a way to make VACUUM remove the empty directories?

I could write code to walk through the Delta table directory and remove the empty directories, but I would rather not touch those directories! That's for Databricks to manage, and I don't want to step in its way.

Thanks in advance for any information you can provide.

9 REPLIES 9

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, For removing files or directories using VACUUM , you can refer https://docs.databricks.com/delta/delta-utility.html#remove-files-no-longer-referenced-by-a-delta-ta...

As far as I know, the dates will be the default naming syntax, which can be renamed.

Anonymous
Not applicable

Hi @Gary Irick​ 

Does @Debayan Mukherjee​  response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?

We'd love to hear from you.

Thanks!

gongasxavi
New Contributor II

The same is happening with me. Since enabling column mapping, the new records are stored in folders with random names instead of being stored in its partition folder

Pete_Cotton
New Contributor II

Same issue is happening with me too since enabling column mapping. Files are stored in folders with random 2 character names (0P, 3h, BB) rather than the date value of the load_date partition column (load_date=2023-01-01, load_date=2023-01-02).

Have tried using databricks runtime 12.0 but get the same result when performing an append or merge operation. Has anyone been able to resolve this yet?

aleks1601
New Contributor II

Is there at least an explanation why this is happening and whether it affects performance?

nan
New Contributor II

seen the same behavior. waiting for some explanation.

Tharun-Kumar
Honored Contributor II
Honored Contributor II

@Gary_Irick @Pete_Cotton 
This is expected. Enabling column mapping enables random file prefixes, which removes the ability to explore data using Hive-style partitioning.

This is also documented here - https://docs.databricks.com/delta/delta-column-mapping.html#:~:text=Enabling%20column%20mapping%20al....

 

vascoa
New Contributor II

Same is happening to me and very frustrating as it irreversibly breaks our process.

Kaniz
Community Manager
Community Manager

Hi @Gary_Irick, @gongasxavi , @Pete_Cotton , @aleks1601

 

Certainly, let’s address your questions regarding Delta table partition directories and column mapping.

 

Directory Names with Column Mapping:

Removing Empty Directories with VACUUM:

  • The VACUUM command in Delta Lake is used to clean up old versions of files and directories.
  • By default, VACUUM removes files that are no longer needed for query consistency.
  • However, it does not automatically remove empty directories created during schema evolution or other operations.
  • To address this, you can use the RETAIN option with VACUUM to specify the minimum number of files to retain in each partition directory.
  • For example :  VACUUM tablename RETAIN 1 HOURS;
  • Adjust the retention period as needed to keep the desired number of files in the partition directori....

Custom Cleanup Logic:

  • If you prefer not to touch the directories manually, consider writing custom code to periodically clean up empty directories.
  • You can schedule a job to traverse the Delta table directory and remove empty directories based on your business logic.
  • Be cautious and ensure that your cleanup logic aligns with Databricks’ management practices.

Remember that while column mapping provides flexibility, it can lead to unconventional directory names. Use the VACUUM command wisely to manage your Delta table storage efficiently. 🚀

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.