cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Deleting external table takes 8 hrs

Ankita1
New Contributor

Hi,

I am trying to delete the data from the external partitioned table, it has around 3 years of data, and the partition is created on the date column.

I am trying to delete each partition first and then the schema of the table, which takes around 8hrs of time. Is there any way to reduce this time?

Thanks.

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @Ankita1 , 

If you need to delete a large amount of data from an external partitioned table, there are a few things you can do to try to reduce the time it takes:

Deleting a large amount of data from an external partitioned table can take significant time. There are several ways to optimize the deletion process and reduce the time.

Here are some tips to help you optimize the delete operation:

  1. Use Delta Lake: Consider converting the external partitioned table to a Delta table. Delta Lake provides optimized read and write performance and enables faster deletes with the VACUUM command. Delta Lake also supports partition pruning, which can speed up queries that filter data by partition columns.

  2. Use a WHERE clause to filter partitions: If you only want to delete certain sections based on a specific date range, use a WHERE clause in the DELETE statement to filter the sections before deleting them. For example:

%sql
DELETE FROM my_table
WHERE date_col >= '2022-01-01' AND date_col < '2022-02-01'

This statement deletes data from my_table for January 2022.

  1. Use parallelism: If your cluster has enough resources, consider increasing the number of concurrent tasks to increase the equality of the delete operation. You can do this by setting the spark.sql.shuffle.partitions configuration parameter higher before running the delete statement. For example, you can select the matter to the number of cores on your cluster to achieve maximum parallelism.

This statement sets the number of shuffle partitions to 8 before running the delete statement.

  1. Optimize storage layout: If your table is not partitioned, consider partitioning it by the date column to speed up the deletes. Partitioning by the date column allows you to delete whole partitions simultaneously instead of deleting individual records. You can also consider using Z-Ordering or clustering to group data with similar keys together, further improving query performance.

  2. Optimize storage format: Consider using a more efficient storage format like Parquet or Delta Lake to reduce the I/O operations needed to read and write the data. Also, consider compressing the data to reduce the storage footprint and I/O operations during the delete operation.

Using these tips, you can optimize the delete operation on a large external partitioned table and reduce the execution time. However, test your solution thoroughly to ensure you are not accidentally deleting more data than intended and maintain data consistency.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!