cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Is Photon Acceleration Helpful for All Maintenance Tasks (OPTIMIZE, VACUUM, ANALYZE_COMPUTE_STATS)?

Sainath368
New Contributor III

Hi everyone,

Weโ€™re currently reviewing the performance impact of enabling Photon acceleration on our Databricks jobs, particularly those involving table maintenance tasks. Our job includes three main operations: OPTIMIZE, VACUUM, and ANALYZE_COMPUTE_STATS. Weโ€™ve observed that enabling Photon significantly improves the performance of the ANALYZE_COMPUTE_STATS taskโ€”it runs much faster when Photon is enabled on the cluster.

Given that, Iโ€™m wondering if enabling Photon for the other two tasks (OPTIMIZE and VACUUM) would also lead to better performance or reduced job time. Has anyone experienced improvements in these tasks with Photon?

Also, more generally, Iโ€™d like to understand which types of tasks or workloads benefit most from Photon acceleration.

Any insights, benchmarks, or shared experiences would be really helpful. Thanks!

 

1 REPLY 1

szymon_dybczak
Esteemed Contributor III

Hi @Sainath368 ,

I wouldn't use photon for this kind of task. You should use it primarly for ETL transformations where it shines.
VACUUM and OPTIMIZE are more of maintenance tasks and using photon would be pricey overkill here.

According to documentation, it is recommended to enable Photon  for workloads with the following characteristics:

  • ETL pipelines consisting of Delta MERGE operations
  • Writing large volumes of data to cloud storage (Delta/Parquet)
  • Scans of large data sets, joins, aggregations and decimal computations
  • Auto Loader to incrementally and efficiently process new data arriving in storage
  • Interactive/ad hoc queries using SQL

Regarding advantages of Photon:

  • Accelerated queries that process a significant amount of data (> 100GB) and include aggregations and joins
  • Faster performance when data is accessed repeatedly from the Delta cache
  • More robust scan/read performance on tables with many columns and many small files
  • Faster Delta writing using UPDATE, DELETE, MERGE INTO, INSERT, and CREATE TABLE AS SELECT
  • Join improvements

Comprehensive Guide to Optimize Data Workloads | Databricks

 


For instance, for VACUUM databricks recommends to use compute optimized instances. And since OPTIMIZE is also compute intensive I guess it also applies to it.

szymon_dybczak_0-1756727080230.png

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now