cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT to push data instead of a pull

_databreaks
New Contributor II

I am relatively new to Databricks, and from my recent experience it appears that every step in a DLT Pipeline, we define each LIVE TABLES (be it streaming or not) to pull data upstream.

I have yet to see an implementation where data from upstream would push its data downstream, say, I could create a bronze table and configure in its definition the silver tables it can push its data into.

This would be especially useful, I think, when ingesting data from Kafka where different topics contain differing  payload(message) schema and would like to segregate these messages by topic, that is, to put each topic to its own table.

 

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @_databreaks, You’re absolutely right!

While the typical approach in Databricks involves pulling data from upstream sources into downstream tables, there are scenarios where a push-based architecture could be beneficial. 

  1. Pull-Based Architecture (Typical Approach):

    • In a pull-based architecture, downstream tables (e.g., silver or gold tables) actively query and pull data from upstream tables (e.g., bronze tables).
    • This approach is common because it allows for flexibility in processing, transformations, and filtering at the downstream stage.
    • It works well when you need to apply complex business logic or aggregations on the data before storing it in downstream tables.
  2. Push-Based Architecture (Alternative Approach):

    • In a push-based architecture, upstream tables (e.g., bronze tables) actively push data to downstream tables (e.g., silver or gold tables).
    • This approach can be useful in specific scenarios:
      • Schema Segregation: As you mentioned, when ingesting data from Kafka with different payload schemas (e.g., different topics), you can create separate downstream tables for each topic. Each topic pushes its data directly to its corresponding table.
      • Reduced Latency: Pushing data downstream can reduce latency because it avoids the need for downstream queries to wait for new data.
      • Simplified ETL Logic: If the transformations needed for downstream tables are straightforward (e.g., filtering, renaming columns), a push-based approach simplifies the ETL logic.
      • Event-Driven Processing: Push-based architectures align well with event-driven processing, where data availability triggers downstream processing.
  3. Implementation Considerations:

    • To implement a push-based architecture:
      • Define your downstream tables (e.g., silver tables) with appropriate schemas.
      • Configure your upstream sources (e.g., Kafka) to push data directly to the corresponding downstream tables.
      • Ensure that data consistency and error handling mechanisms are in place.
      • Monitor and manage the flow of data to prevent bottlenecks or data loss.
  4. Hybrid Approaches:

    • In practice, hybrid approaches are often used. For example:
      • Initial data ingestion may be pull-based (e.g., from Kafka to a bronze table).
      • Subsequent processing stages (e.g., filtering, aggregations) can be push-based (from bronze to silver tables).
      • This allows flexibility while optimizing for performance and simplicity.
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!