Exploring parallelism for multiple tables

suja — Wed, 30 Apr 2025 02:48:49 GMT

I am new to databricks. The app we need to build reads from hive tables, go thru bronze, silver and gold layers and store in relational db tables. There are multiple hive tables with no dependencies. What is the best way to achieve parallelism. Do we use threads thru each layers to process the multiple tables or run as separate tasks in jobs or any other suggestions. What would be the efficient way of implementation. Thanks

Re: Exploring parallelism for multiple tables

lingareddy_Alva — Wed, 30 Apr 2025 03:43:57 GMT

Hi @suja

Use Databricks Workflows (Jobs) with Task Parallelism
Instead of using threads within a single notebook, leverage Databricks Jobs to define multiple tasks, each responsible for a table. Tasks can:
1. Run in parallel
2. Be modular and reusable
3. Be monitored and retried independently
Each task (or task group) would represent processing for one Hive table from Bronze → Silver → Gold.

Avoid Using Threads for Spark Workloads
Using Python threads for Spark workloads is not recommended, because:
Spark is already distributed.
Threads don’t provide real parallelism in Python (due to GIL)
You lose visibility, fault tolerance, and scalability.

Use Databricks Workflows with parallel tasks—each processing one Hive table through Bronze → Silver → Gold—and writing to relational DB. Avoid threading and instead modularize processing via parameterized notebooks or scripts.
Spark jobs scale better via job tasks rather than threads

topic Re: Exploring parallelism for multiple tables in Data Engineering

Exploring parallelism for multiple tables

Re: Exploring parallelism for multiple tables