Best Practices: 1 job per 1 target table
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
5 hours ago
We’re currently designing our Medallion Architecture pipelines using Lakeflow Jobs, and I wanted to get some opinions on orchestration best practices.
Right now, our approach is essentially 1 job per target table (for example, each Bronze/Silver/Gold table has its own dedicated Lakeflow job). The idea is to keep pipelines isolated, modular, and easier to troubleshoot.
However, I’m wondering about the long-term tradeoffs:
- Is this considered a good practice for scalability and maintainability?
- Could having a very large number of small jobs become inefficient in the future (job scheduling overhead, monitoring complexity, cost, etc.)?
- At what point does it make more sense to group multiple tables into a single workflow/job instead?
- How do teams usually balance modularity vs orchestration overhead in a Medallion Architecture setup?
Would love to hear how others structure their pipelines in production environments, especially for Databricks/Lakeflow-based architectures.
- Labels:
-
Workflows
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
4 hours ago
I am assuming you are talking about the job to load bronze and silver tables . Having one job per table seems like a bad idea since at scale you will most likely start hitting the limits of the workspace apart from operational overhead of maintaining/monitoring/deployment/compute wastage so many jobs . Typically you would use a metadata table/yaml file to define the configuration and then group your tables into diffrent pipelines based on various factors like ( business domains/trigger/schedule/volume/velocity etc ) .
Gold tables would have their own pipelines if they have complex dependencies but bronze and silver should be pretty straightforward metadata driven pipelines/jobs
Pradeep Singh - https://www.linkedin.com/in/dbxdev
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 hours ago
We typically organize our workloads with one job per catalog, and then use one or more pipelines to load tables into the appropriate schemas. As our data engineers ingest raw data, this structure is primarily applied in the Silver and Gold layers of our architecture.
For example, when loading Salesforce data, we might structure it like this:
salesforce_silver (job)
sales (schema) → Pipeline
Sales-related tables (as needed within the schema)
procurement (schema) → Pipeline
Procurement-related tables (as needed within the schema)
This same job-and-pipeline pattern is carried into the Gold layer. However, the structure often evolves there, since Gold datasets may combine data across multiple catalogs and schemas.
Ultimately, your naming conventions and structure should reflect your specific design and use cases.
Larissa