Databricks Community

AndyG · ‎09-05-2024

I have delta tables with multiple partition columns. I want to remove most of the partition columns and retain just one. I can see there are ALTER TABLE...PARTITION options but these are not supported for delta lake tables. So is there a way to do this - or do I need to recreate the tables.

szymon_dybczak · ‎09-05-2024

Hi @AndyG ,

Maybe try the way official delta guid is suggesting:

Adding and Deleting Partitions in Delta Lake tables | Delta Lake

You can delete all rows from a given partition to remove the partition from the Delta table.

Here’s how to delete all the rows with individuals from Argentina.

Copy

dt = delta.DeltaTable.forName(spark, "country_people")

dt.delete(F.col("country") == "Argentina")

Let’s run the vacuum twice and observe how the Argentina partition is deleted from the filesystem.

Copy

spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

spark.sql("VACUUM country_people RETAIN 0 HOURS")
spark.sql("VACUUM country_people RETAIN 0 HOURS")

NOTE: We’re only setting the retention period to 0 hours in this example to demonstrate disk structure changes. The retention period should normally be at least 7 days. A retention period of 0 hours is dangerous because it can break concurrent write operations and time travel.

View the contents of the filesystem and make sure that the Argentina partition was removed.

Copy

spark-warehouse/country_people
├── _delta_log
│   ├── 00000000000000000000.json
│   ├── 00000000000000000001.json
│   └── 00000000000000000002.json
├── country=China
│   └── part-00000-9a8d67fa-c23d-41a4-b570-a45405f9ad78.c000.snappy.parquet
├── country=Colombia
│   └── part-00000-7e3d3d49-39e9-4eb2-ab92-22a485291f91.c000.snappy.parquet
└── country=Russia
    └── part-00000-c49ca623-ea69-4088-8d85-c7c2de30cc28.c000.snappy.parquet

You need to run vacuum twice to completely remove the Argentina partition. The first vacuum run deletes the files with Argentina data, and the Argentina directory becomes empty. The second vacuum run deletes the empty Argentina directory. You don’t normally have to run vacuum twice for all changes to take effect, but this is a special edge case. See this blog post to learn more about the vacuum command.

AndyG · ‎09-05-2024

I'm not looking to delete individual partitions, but change the way the tables are actually partitioned. They use multiple columns as partitions - I want to remove most of the columns and retain just one column. So the tables are partitioned using one column only.

AndySkinner · ‎09-05-2024

Hi Slash,

I`ve seen mention that one way to do it (and a recommended by databricks way) is to use REPLACE TABLE. I`ve personally tried that in the past but it drops any auto incremement columns on the replaced table which is a problem for us.

szymon_dybczak · ‎09-05-2024

Hi @AndyG , @AndySkinner

Yeah, I misunderstood the question. I would do this in following way:

REPLACE TABLE <tablename>
  USING DELTA
  PARTITIONED BY (column_name)
AS
 SELECT * FROM <tablename>

Best practices: Delta Lake | Databricks on AWS

AndySkinner · ‎09-05-2024

Hey @szymon_dybczak ,

This removes the auto increment column on the new table that gets created, which is a big problem

Andy

szymon_dybczak · ‎09-05-2024

Yep, but I don't think there is a way do it without messing up with auto increment. But maybe someone share some idea...

Databricks Community

Remove partition column from delta table

Photos

Join Us as a Local Community Builder!

Business Intelligence in the Era of AI

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Databricks Community Champion - March 2025 - Takuya Omi

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.