1. Try to remove cache() and persist() in the dataframe operations in the code base.
2. Fully avoid driver operations like collect() and take() - the information from the executors are brought back to driver, which is highly network i/o overhead.
3. Avoid partitioning on the smaller and medium size datasets - Databricks takes care of the datasets just by adding auto liquid clustering.
4. Avoid driver based pandas code in spark codebase - it would be then a driver based operation and you are not underutilizing distributed processing, convert to spark dataframes, and take the full power of executors.
5. Huge DAGs - Try to persist the intermediate steps to temp table and hence data gets persisted. What will happen then is Huge DAGs will no longer collect all the information in memory, rather with the tiny DAGs makes the code easily debuggable and smooth runs reducing overhead.
6. Avoid photon - Its actually not needed to use photon for every query. Its so costly compared to all purpose cluster. SQL warehouse and Serverless by default uses photon engine, but while working on notebooks with all-purpose, then you dont need it most probably.
7. Avoid DLT - Its pricing is so high compared to stream-stream join on all purpose or job compute, Create a DLT, make it work, when you think data is getting expected results, try to replicate setup as a Job with schedule, its actually same but way too cheap, and you have control over the checkpoints and failure
Chanukya