- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-01-2025 04:41 AM - edited 09-01-2025 04:48 AM
Hi @Sainath368 ,
I wouldn't use photon for this kind of task. You should use it primarly for ETL transformations where it shines.
VACUUM and OPTIMIZE are more of maintenance tasks and using photon would be pricey overkill here.
According to documentation, it is recommended to enable Photon for workloads with the following characteristics:
- ETL pipelines consisting of Delta MERGE operations
- Writing large volumes of data to cloud storage (Delta/Parquet)
- Scans of large data sets, joins, aggregations and decimal computations
- Auto Loader to incrementally and efficiently process new data arriving in storage
- Interactive/ad hoc queries using SQL
Regarding advantages of Photon:
- Accelerated queries that process a significant amount of data (> 100GB) and include aggregations and joins
- Faster performance when data is accessed repeatedly from the Delta cache
- More robust scan/read performance on tables with many columns and many small files
- Faster Delta writing using UPDATE, DELETE, MERGE INTO, INSERT, and CREATE TABLE AS SELECT
- Join improvements
Comprehensive Guide to Optimize Data Workloads | Databricks
For instance, for VACUUM databricks recommends to use compute optimized instances. And since OPTIMIZE is also compute intensive I guess it also applies to it.