cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Job running on Attacama Profiler takes long to complete or crashes

anardinelli
New Contributor III
New Contributor III

If you're trying to run the Ataccama profiler on tables with multiple joins or are incredibly large, please note that there are some processes on the Ataccama profiler that will lead to bad performance issues.

If you are having jobs that are crashing or running during multiple hours, please check:

1. Running aggregations profiles could lead to grouping data in single partitions which can shuffle a lot of data through the worker nodes. Operations such as groupByKey and sortByKey are costly and not optimized on the Ataccama tool. Please increase worker memory size if you see too much data is being shuffled on the sparkUI stages tab.

2. Run OPTIMIZE on your Delta tables that are being profiled.

3. If you're running multiple joins during the profiling process, please join the tables first, outside the profiling data flow, and run the profile on the joined table, after running OPTIMIZE on the final Delta Table.

4. Please check for spot instance terminations on the cluster "Event Log" page. If there are isntances being terminated, please use another instance type or transform them to On-demand.

5. Disable some of the profiling processes on the Ataccama tool.

0 REPLIES 0
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!