cancel
Showing results for 
Search instead for 
Did you mean: 
Announcements
Stay up-to-date with the latest announcements from Databricks. Learn about product updates, new features, and important news that impact your data analytics workflow.
cancel
Showing results for 
Search instead for 
Did you mean: 

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

K_Anudeep
Databricks Employee
Databricks Employee

Hello Community,

Let me start off with a quick question:

Have you ever...

  • Migrated your workloads from on-prem Spark to Databricks and encountered a bug and thought, “I wish I could repro this locally to debug the issue without burning cluster hours?”

  • Seen a third-party library work fine on OSS Spark but fail on Databricks, with no easy way to debug?
  • Needed to test behaviour across different Spark / Scala / Java versions… but spinning up multiple clusters just for that felt like overkill?
  • Want to set up a Spark environment locally to learn and experiment with the latest versions of Spark?

If yes, this project is literally built for you. 

------------------------------------------------------------------------------------------

Meet Spark with Hadoop Anywhere

I’ve open-sourced a project called Spark with Hadoop Anywhere – a production-like Spark + Hadoop + Hive stack on Docker, specifically designed for:

  • On-prem Open source Spark and Databricks users (Support, SREs, Data Eng etc.)

  • People are debugging issues with DBR vs OSS Spark
  • Anyone who wants a realistic local analytics node can run it anywhere Docker is available
  • Anyone who wants to learn Spark and have a cluster-like setup installed locally

🔗 Project page (docs + overview):
https://anudeepkonaboina.github.io/spark-with-hadoop-anywhere/

🔗 GitHub repo (please  and fork!):
https://github.com/AnudeepKonaboina/spark-with-hadoop-anywhere

 

Key features (why you should at least fork it  👇)

1. DBR-aligned version matrix

Each Git branch is pinned to a specific Spark / Scala / Java combo, aligned with the OSS Spark versions used by Databricks Runtime (DBR)

More Details here: https://anudeepkonaboina.github.io/spark-with-hadoop-anywhere/#dbr-underlying-spark-oss-compatible-b...

2. Full analytics node, not just Spark

You don’t just get a Spark binary thrown in a container.

You get a single-node analytics stack:

  • Spark (standalone: master + worker in one container)

  • HDFS (namenode + datanode, real filesystem semantics)
  • Hive Metastore backed by PostgreSQL in a separate container
  • Hive CLI & Beeline wired up

3. One-command setup via setup-spark.sh

The entire stack is orchestrated through a single script, and just by running a single command, you will have a single-node cluster-like setup on your laptop

Why I’m posting this on Databricks Community

This project was born out of exactly the kind of pain users  working with OSS spark or Databricks face:

  • Support / SRE / PS engineers needing fast, realistic repros

  • Customers hitting weird corner cases in Spark / Delta/connectors
  • Devs who want DBR-aligned OSS Spark locally without owning infrastructure

If that sounds like you, I’d honestly love it if you:

  1. Fork the repo

  2. Spin up the stack for the Spark version you care about
  3. Try reproducing one of your current/old issues
  4. Share feedback/issues / PRs

---------------------------------

Interested??

👉 Fork and explore it & star the repo if you like it:
https://github.com/AnudeepKonaboina/spark-with-hadoop-anywhere

👉 Docs/overview (easier to share inside your team):
https://anudeepkonaboina.github.io/spark-with-hadoop-anywhere/

If you end up using this to debug a challenging Spark/HDFS/Hive or DBR issue, please leave a comment or open an issue in the repository – I’d love to hear about your experience and what would make the stack even more useful for the Databricks community.

Anudeep
4 REPLIES 4

KaushalVachhani
Databricks Employee
Databricks Employee

This is amazing, @K_Anudeep. Users will benefit from this.

JAHNAVI
Databricks Employee
Databricks Employee

Fantastic @K_Anudeep, this is truly amazing, and this will help a lot and provide such a lightweight environment.

Jahnavi N

nikhilj0421
Databricks Employee
Databricks Employee

This is fantastic @K_Anudeep, and really helpful. 

RevanthV
New Contributor III

Hey, I have tried this out on my laptop and it hardly takes 3 minutes to setup a cluster like env locally..This is really helpful.Great work @K_Anudeep 👏👏