09-13-2022 11:20 AM
I recently created a table on a cluster in Azure running Databricks Runtime 11.1. The table is partitioned by a "date" column. I enabled column mapping, like this:
ALTER TABLE {schema}.{table_name} SET TBLPROPERTIES('delta.columnMapping.mode' = 'name', 'delta.minReaderVersion' = '2', 'delta.minWriterVersion' = '5')
Before enabling column mapping, the directory containing the Delta table has the expected partition directories: "date=2022-08-18", "date=2022-08-19", etc.
After enabling column mapping, every time I do a MERGE into that table, I get new directories created with short names like "5k", "Rw", "Yd", etc. When I VACUUM the table, most of the directories are empty, but the empty directories are not removed. We merge into this table frequently, so the table containing the Delta table ends up with lots and lots of empty directories.
I have 2 questions:
Is it expected that these directories will be created with names other than the expected "date=2022-08-18"?
Is there a way to make VACUUM remove the empty directories?
I could write code to walk through the Delta table directory and remove the empty directories, but I would rather not touch those directories! That's for Databricks to manage, and I don't want to step in its way.
Thanks in advance for any information you can provide.
09-15-2022 09:55 PM
Hi, For removing files or directories using VACUUM , you can refer https://docs.databricks.com/delta/delta-utility.html#remove-files-no-longer-referenced-by-a-delta-ta...
As far as I know, the dates will be the default naming syntax, which can be renamed.
09-27-2022 05:11 AM
Hi @Gary Irick
Does @Debayan Mukherjee response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?
We'd love to hear from you.
Thanks!
12-16-2022 09:26 AM
The same is happening with me. Since enabling column mapping, the new records are stored in folders with random names instead of being stored in its partition folder
01-03-2023 04:00 AM
Same issue is happening with me too since enabling column mapping. Files are stored in folders with random 2 character names (0P, 3h, BB) rather than the date value of the load_date partition column (load_date=2023-01-01, load_date=2023-01-02).
Have tried using databricks runtime 12.0 but get the same result when performing an append or merge operation. Has anyone been able to resolve this yet?
04-04-2023 04:51 AM
Is there at least an explanation why this is happening and whether it affects performance?
07-12-2023 12:37 PM
seen the same behavior. waiting for some explanation.
07-12-2023 01:52 PM
@Gary_Irick @Pete_Cotton
This is expected. Enabling column mapping enables random file prefixes, which removes the ability to explore data using Hive-style partitioning.
This is also documented here - https://docs.databricks.com/delta/delta-column-mapping.html#:~:text=Enabling%20column%20mapping%20al....
11-20-2023 08:48 AM
Same is happening to me and very frustrating as it irreversibly breaks our process.
07-30-2024 01:22 AM - edited 07-30-2024 03:13 AM
Hi @Retired_mod ,
I have few queries on Directory Names with Column Mapping. I have this delta table on ADLS and I am trying to read it, but I am getting below error. How can we read delta tables with column mapping enabled with pyspark?
Can you please help.
A partition path fragment should be the form like `part1=foo/part2=bar`. The partition path: {{delta table name}}
Edit:
I was able to read tables as is. Maybe some issue with delta version
Regards,
Nikhil
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group