Databricks Community

kristym · ‎10-20-2025

I’m currently analyzing a large geospatial dataset focused on Michigan county boundaries and map data, and I’m using Apache Spark on Databricks to process and transform millions of records.

Even though I’ve optimized basic things like repartitioning, using cache(), and adjusting cluster size, my jobs still take a long time to complete — especially during wide transformations and joins across multiple data sources.

What are the most effective techniques or configurations in Databricks to:

Improve job performance for large datasets
Handle shuffle operations more efficiently
Optimize joins and partitioning for geospatial or map-based data
Reduce memory overhead or out-of-memory errors
Take advantage of Delta Lake features for faster queries

I’d also love to learn if there are real-world examples or tuning guides for handling map-style datasets (like county-level data) efficiently.

For context, I’m working with a dataset similar to what’s publicly available on Michigan County Map, focusing on region-based insights and boundary-level processing.

https://michigancountymap.com/

-werners- · ‎10-20-2025

I do not have experience with geospatial data on databricks.
But I do know that since a while, Sedona can be installed on Databricks.
Sedona is created for large-scale geospatial data processing. Sounds like something for you no?

https://sedona.apache.org/latest/setup/databricks/

Databricks Community

How to Optimize Spark Jobs in Databricks for Large-Scale Geospatial Data Processing?

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! December 12 – 21, 2025

PSA: Community Edition retires on January 1, 2026. Move to the Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

Celebrating Our First Brickster Champion: Louis Frolio