topic Re: Best practices for optimizing Spark jobs in Get Started Discussions

Best practices for optimizing Spark jobs

chris0991 — Tue, 01 Oct 2024 06:46:50 GMT

What are some best practices for optimizing Spark jobs in Databricks, especially when dealing large datasets? Any tips or resources would be greatly appreciated! I’m trying to analyze data on restaurant menu prices so that insights would be especially helpful!

Re: Best practices for optimizing Spark jobs

-werners- — Tue, 01 Oct 2024 10:12:07 GMT

There are so many.
Here are a few:
- look for data skew
- shuffle as less as possible
- avoid many small files
- use spark and not only pure python
- if using an autoscale cluster: check if you don't lose a lot of time scaling up/down

Re: Best practices for optimizing Spark jobs

szymon_dybczak — Tue, 15 Jul 2025 09:12:13 GMT

Good one @john34567 , made me chuckle but still this is a spam 😄

Re: Best practices for optimizing Spark jobs

Nohashah — Thu, 20 Nov 2025 23:33:47 GMT

Optimizing Spark jobs is all about using smart data strategies like minimizing shuffles, tuning partitions, caching only what truly matters, and choosing the right file format to keep workloads efficient and cost-effective, and it reminds me of how planning ahead works just like checking the Wetherspoons kids menu before ordering so everything runs smoother, faster, and without unnecessary delays.

Re: Best practices for optimizing Spark jobs

Coffee77 — Fri, 21 Nov 2025 08:56:15 GMT

In addition to above cool comments, try to use clusters with VMs enabled for disk caching as well. This caches data at parquet files level in VM local storage, acting as a great complement to spark caching.