Databricks Community

DazzaiDe · 3 weeks ago

We’re currently designing our Medallion Architecture pipelines using Lakeflow Jobs, and I wanted to get some opinions on orchestration best practices.

Right now, our approach is essentially 1 job per target table (for example, each Bronze/Silver/Gold table has its own dedicated Lakeflow job). The idea is to keep pipelines isolated, modular, and easier to troubleshoot.

However, I’m wondering about the long-term tradeoffs:

Is this considered a good practice for scalability and maintainability?
Could having a very large number of small jobs become inefficient in the future (job scheduling overhead, monitoring complexity, cost, etc.)?
At what point does it make more sense to group multiple tables into a single workflow/job instead?
How do teams usually balance modularity vs orchestration overhead in a Medallion Architecture setup?

Would love to hear how others structure their pipelines in production environments, especially for Databricks/Lakeflow-based architectures.

pradeep_singh · 3 weeks ago

I am assuming you are talking about the job to load bronze and silver tables . Having one job per table seems like a bad idea since at scale you will most likely start hitting the limits of the workspace apart from operational overhead of maintaining/monitoring/deployment/compute wastage so many jobs . Typically you would use a metadata table/yaml file to define the configuration and then group your tables into diffrent pipelines based on various factors like ( business domains/trigger/schedule/volume/velocity etc ) .

Gold tables would have their own pipelines if they have complex dependencies but bronze and silver should be pretty straightforward metadata driven pipelines/jobs

Thank You
Pradeep Singh - https://www.linkedin.com/in/dbxdev

LBoydston · 3 weeks ago

We typically organize our workloads with one job per catalog, and then use one or more pipelines to load tables into the appropriate schemas. As our data engineers ingest raw data, this structure is primarily applied in the Silver and Gold layers of our architecture.

For example, when loading Salesforce data, we might structure it like this:

salesforce_silver (job)
- sales (schema) → Pipeline
  - Sales-related tables (as needed within the schema)
- procurement (schema) → Pipeline
  - Procurement-related tables (as needed within the schema)

This same job-and-pipeline pattern is carried into the Gold layer. However, the structure often evolves there, since Gold datasets may combine data across multiple catalogs and schemas.

Ultimately, your naming conventions and structure should reflect your specific design and use cases.

Larissa