Databricks Community

chris0991 · ‎09-30-2024

What are some best practices for optimizing Spark jobs in Databricks, especially when dealing large datasets? Any tips or resources would be greatly appreciated! I’m trying to analyze data on restaurant menu prices so that insights would be especially helpful!

-werners- · ‎10-01-2024

There are so many.
Here are a few:
- look for data skew
- shuffle as less as possible
- avoid many small files
- use spark and not only pure python
- if using an autoscale cluster: check if you don't lose a lot of time scaling up/down

Jhon · ‎11-05-2024

Great question! For optimizing Spark jobs in Databricks, try these tips:

Efficient partitioning to reduce shuffle times.
Caching for repeated datasets.
Broadcast joins for small tables.
Tune Spark configurations like spark.sql.shuffle.partitions.

If you're analyzing restaurant menu data, Ambersmenu.com.ph has some useful insights on organizing and optimizing such datasets. Worth checking out!

Hope this helps!

Databricks Community

Best practices for optimizing Spark jobs

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon