Databricks Community

suja · ‎04-29-2025

I am new to databricks. The app we need to build reads from hive tables, go thru bronze, silver and gold layers and store in relational db tables. There are multiple hive tables with no dependencies. What is the best way to achieve parallelism. Do we use threads thru each layers to process the multiple tables or run as separate tasks in jobs or any other suggestions. What would be the efficient way of implementation. Thanks

lingareddy_Alva · ‎04-29-2025

Hi @suja

Use Databricks Workflows (Jobs) with Task Parallelism
Instead of using threads within a single notebook, leverage Databricks Jobs to define multiple tasks, each responsible for a table. Tasks can:
1. Run in parallel
2. Be modular and reusable
3. Be monitored and retried independently
Each task (or task group) would represent processing for one Hive table from Bronze → Silver → Gold.

Avoid Using Threads for Spark Workloads
Using Python threads for Spark workloads is not recommended, because:
Spark is already distributed.
Threads don’t provide real parallelism in Python (due to GIL)
You lose visibility, fault tolerance, and scalability.

Use Databricks Workflows with parallel tasks—each processing one Hive table through Bronze → Silver → Gold—and writing to relational DB. Avoid threading and instead modularize processing via parameterized notebooks or scripts.
Spark jobs scale better via job tasks rather than threads

LR

Databricks Community

Exploring parallelism for multiple tables

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐