ataccama profiling failing on databricks

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

If you're trying to run the Ataccama profiler on tables with multiple joins or are incredibly large, please note that there are some processes on the Ataccama profiler that will lead to bad performance issues.

If you are having jobs that are crashing or running during multiple hours, please check:

1. Running aggregations profiles could lead to grouping data in single partitions which can shuffle a lot of data through the worker nodes. Operations such as groupByKey and sortByKey are costly and not optimized on the Ataccama tool. Please increase worker memory size if you see too much data is being shuffled on the sparkUI stages tab.

2. Run OPTIMIZE on your Delta tables that are being profiled.

3. If you're running multiple joins during the profiling process, please join the tables first, outside the profiling data flow, and run the profile on the joined table, after running OPTIMIZE on the final Delta Table.

4. Please check for spot instance terminations on the cluster "Event Log" page. If there are isntances being terminated, please use another instance type or transform them to On-demand.

5. Disable some of the profiling processes on the Ataccama tool.

0 REPLIES 0

Photos

Upload Upload
URL URL
Saved Photos Saved Photos

Upload location

Upload location

Add Photos to Album:

New Album

Drag here to start uploading

Drag photos here or

Tap for upload options

You must install or upgrade to the latest version of Adobe Flash Player before you can upload images.