09-05-2024 01:16 AM
I have delta tables with multiple partition columns. I want to remove most of the partition columns and retain just one. I can see there are ALTER TABLE...PARTITION options but these are not supported for delta lake tables. So is there a way to do this - or do I need to recreate the tables.
09-05-2024 01:20 AM - edited 09-05-2024 01:21 AM
Hi @AndyG ,
Maybe try the way official delta guid is suggesting:
Adding and Deleting Partitions in Delta Lake tables | Delta Lake
You can delete all rows from a given partition to remove the partition from the Delta table.
Here’s how to delete all the rows with individuals from Argentina.
dt = delta.DeltaTable.forName(spark, "country_people") dt.delete(F.col("country") == "Argentina")
Let’s run the vacuum twice and observe how the Argentina partition is deleted from the filesystem.
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false") spark.sql("VACUUM country_people RETAIN 0 HOURS") spark.sql("VACUUM country_people RETAIN 0 HOURS")
NOTE: We’re only setting the retention period to 0 hours in this example to demonstrate disk structure changes. The retention period should normally be at least 7 days. A retention period of 0 hours is dangerous because it can break concurrent write operations and time travel.
View the contents of the filesystem and make sure that the Argentina partition was removed.
spark-warehouse/country_people ├── _delta_log │ ├── 00000000000000000000.json │ ├── 00000000000000000001.json │ └── 00000000000000000002.json ├── country=China │ └── part-00000-9a8d67fa-c23d-41a4-b570-a45405f9ad78.c000.snappy.parquet ├── country=Colombia │ └── part-00000-7e3d3d49-39e9-4eb2-ab92-22a485291f91.c000.snappy.parquet └── country=Russia └── part-00000-c49ca623-ea69-4088-8d85-c7c2de30cc28.c000.snappy.parquet
You need to run vacuum twice to completely remove the Argentina partition. The first vacuum run deletes the files with Argentina data, and the Argentina directory becomes empty. The second vacuum run deletes the empty Argentina directory. You don’t normally have to run vacuum twice for all changes to take effect, but this is a special edge case. See this blog post to learn more about the vacuum command.
09-05-2024 01:27 AM
I'm not looking to delete individual partitions, but change the way the tables are actually partitioned. They use multiple columns as partitions - I want to remove most of the columns and retain just one column. So the tables are partitioned using one column only.
09-05-2024 01:41 AM
Hi Slash,
I`ve seen mention that one way to do it (and a recommended by databricks way) is to use REPLACE TABLE. I`ve personally tried that in the past but it drops any auto incremement columns on the replaced table which is a problem for us.
09-05-2024 01:43 AM
Hi @AndyG , @AndySkinner
Yeah, I misunderstood the question. I would do this in following way:
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (column_name)
AS
SELECT * FROM <tablename>
09-05-2024 01:55 AM
Hey @szymon_dybczak ,
This removes the auto increment column on the new table that gets created, which is a big problem
Andy
09-05-2024 02:11 AM - edited 09-05-2024 02:34 AM
Yep, but I don't think there is a way do it without messing up with auto increment. But maybe someone share some idea...
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group