cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delta table partition directories when column mapping is enabled

Gary_Irick
New Contributor III

I recently created a table on a cluster in Azure running Databricks Runtime 11.1. The table is partitioned by a "date" column. I enabled column mapping, like this:

ALTER TABLE {schema}.{table_name} SET TBLPROPERTIES('delta.columnMapping.mode' = 'name', 'delta.minReaderVersion' = '2', 'delta.minWriterVersion' = '5')

Before enabling column mapping, the directory containing the Delta table has the expected partition directories: "date=2022-08-18", "date=2022-08-19", etc.

After enabling column mapping, every time I do a MERGE into that table, I get new directories created with short names like "5k", "Rw", "Yd", etc. When I VACUUM the table, most of the directories are empty, but the empty directories are not removed. We merge into this table frequently, so the table containing the Delta table ends up with lots and lots of empty directories.

I have 2 questions:

Is it expected that these directories will be created with names other than the expected "date=2022-08-18"?

Is there a way to make VACUUM remove the empty directories?

I could write code to walk through the Delta table directory and remove the empty directories, but I would rather not touch those directories! That's for Databricks to manage, and I don't want to step in its way.

Thanks in advance for any information you can provide.

9 REPLIES 9

Debayan
Databricks Employee
Databricks Employee

Hi, For removing files or directories using VACUUM , you can refer https://docs.databricks.com/delta/delta-utility.html#remove-files-no-longer-referenced-by-a-delta-ta...

As far as I know, the dates will be the default naming syntax, which can be renamed.

Anonymous
Not applicable

Hi @Gary Irickโ€‹ 

Does @Debayan Mukherjeeโ€‹  response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?

We'd love to hear from you.

Thanks!

gongasxavi
New Contributor II

The same is happening with me. Since enabling column mapping, the new records are stored in folders with random names instead of being stored in its partition folder

Pete_Cotton
New Contributor II

Same issue is happening with me too since enabling column mapping. Files are stored in folders with random 2 character names (0P, 3h, BB) rather than the date value of the load_date partition column (load_date=2023-01-01, load_date=2023-01-02).

Have tried using databricks runtime 12.0 but get the same result when performing an append or merge operation. Has anyone been able to resolve this yet?

aleks1601
New Contributor II

Is there at least an explanation why this is happening and whether it affects performance?

nan
New Contributor II

seen the same behavior. waiting for some explanation.

Tharun-Kumar
Databricks Employee
Databricks Employee

@Gary_Irick @Pete_Cotton 
This is expected. Enabling column mapping enables random file prefixes, which removes the ability to explore data using Hive-style partitioning.

This is also documented here - https://docs.databricks.com/delta/delta-column-mapping.html#:~:text=Enabling%20column%20mapping%20al....

 

vascoa
New Contributor II

Same is happening to me and very frustrating as it irreversibly breaks our process.

talenik
New Contributor III

Hi @Retired_mod , 

I have few queries on Directory Names with Column Mapping. I have this delta table on ADLS and I am trying to read it, but I am getting below error. How can we read delta tables with column mapping enabled with pyspark?

Can you please help.

A partition path fragment should be the form like `part1=foo/part2=bar`. The partition path: {{delta table name}}

Edit:

I was able to read tables as is. Maybe some issue with delta version

Regards,

Nikhil

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group