Databricks Community

_databreaks · ‎05-09-2024

I am relatively new to Databricks, and from my recent experience it appears that every step in a DLT Pipeline, we define each LIVE TABLES (be it streaming or not) to pull data upstream.

I have yet to see an implementation where data from upstream would push its data downstream, say, I could create a bronze table and configure in its definition the silver tables it can push its data into.

This would be especially useful, I think, when ingesting data from Kafka where different topics contain differing payload(message) schema and would like to segregate these messages by topic, that is, to put each topic to its own table.

Kaniz_Fatma · ‎05-20-2024

Hi @_databreaks, You’re absolutely right!

While the typical approach in Databricks involves pulling data from upstream sources into downstream tables, there are scenarios where a push-based architecture could be beneficial.

Pull-Based Architecture (Typical Approach):
- In a pull-based architecture, downstream tables (e.g., silver or gold tables) actively query and pull data from upstream tables (e.g., bronze tables).
- This approach is common because it allows for flexibility in processing, transformations, and filtering at the downstream stage.
- It works well when you need to apply complex business logic or aggregations on the data before storing it in downstream tables.
Push-Based Architecture (Alternative Approach):
- In a push-based architecture, upstream tables (e.g., bronze tables) actively push data to downstream tables (e.g., silver or gold tables).
- This approach can be useful in specific scenarios:
  - Schema Segregation: As you mentioned, when ingesting data from Kafka with different payload schemas (e.g., different topics), you can create separate downstream tables for each topic. Each topic pushes its data directly to its corresponding table.
  - Reduced Latency: Pushing data downstream can reduce latency because it avoids the need for downstream queries to wait for new data.
  - Simplified ETL Logic: If the transformations needed for downstream tables are straightforward (e.g., filtering, renaming columns), a push-based approach simplifies the ETL logic.
  - Event-Driven Processing: Push-based architectures align well with event-driven processing, where data availability triggers downstream processing.
Implementation Considerations:
- To implement a push-based architecture:
  - Define your downstream tables (e.g., silver tables) with appropriate schemas.
  - Configure your upstream sources (e.g., Kafka) to push data directly to the corresponding downstream tables.
  - Ensure that data consistency and error handling mechanisms are in place.
  - Monitor and manage the flow of data to prevent bottlenecks or data loss.
Hybrid Approaches:
- In practice, hybrid approaches are often used. For example:
  - Initial data ingestion may be pull-based (e.g., from Kafka to a bronze table).
  - Subsequent processing stages (e.g., filtering, aggregations) can be push-based (from bronze to silver tables).
  - This allows flexibility while optimizing for performance and simplicity.

Databricks Community

DLT to push data instead of a pull

🔔 ALERT: Act Now to Protect Your Community Account; Secure Your Details Before It's Too Late!

Databricks Learning Festival (Virtual): 10 July - 24 July 2024

Data + AI Summit 2024: An Executive Summary for Data Leaders

Big Data Is Back and Is More Important Than AI