cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
cancel
Showing results for 
Search instead for 
Did you mean: 

Merge version data files of Delta table

Data_Analytics1
Contributor III

Hi,

I am having one CDC enabled Delta table. In 256th version, table is having 50 data files. I want all to merge and create a single file. How can I merge all 50 data file and when I query for 256th version, I should get 1 data file? Is there any command which can optimize the file size?

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @Data_Analytics1 , 

Yes, you can use Delta's OPTIMIZE command to compact a Delta table's small files into larger ones, which can help improve query performance and reduce storage costs.

Here is an example of how to optimize your Delta table:

from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/path/to/your/delta/table")

deltaTable\
  .alias("data")\
  .merge()\
  .whenMatchedUpdateAll()\
  .whenNotMatchedInsertAll()\
  .execute()

deltaTable.optimize()

The optimize command rewrites the whole Delta table by merging the small files into larger ones. After you have run the optimize command, you should see fewer total files in the Delta table and improved query performance.

Note that the optimize command is a relatively expensive operation and should only be used when necessary, such as to address performance issues with the table.

Also, keep in mind that when you run an optimize command, a new version of the Delta table will be created, and the original small files will still exist in the previous version's transaction logs. This means that the optimized Delta table might still have some unused space in the file system.

You can optionally supply the ZORDER BY clause to the OPTIMIZE command if you want to organize the data in the optimized file. This can improve query performance when you use filters or group data based on specific columns.

Hi @Kaniz ,

Here 256th version data files is of the CDF where I am querying the data for this version by following code:

display(spark.read.format("delta") \
  .option("readChangeFeed", "true") \
  .option("startingVersion", 256) \
  .option("endingVersion", 256) \
  .table("catalogName.schemaName.tableName"))

 Current version of the table is 300 but I wants to see what are the updates in 256th version. But what I found is that, this version is having 50 data files, reading which my code is displaying the data to me. As it is having 50 files, it is taking time to return the result. How to optimize the 256th version of the table when my current table version is 300?

OPTIMIZE command will optimize the current version (snapshot) of the table and creates the new version with lesser number of file.

Debayan
Esteemed Contributor III
Esteemed Contributor III
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.