cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Streamed DLT Pipeline using a lookup table

Mathias_Peters
Contributor II

Hi, 

I need to join three streams/streamed data sets in a DLT pipeline. I am reading from a Kinesis data stream a sequence of events per group key. The logically first of the events per group contains a marker which determines whether that group is relevant for me or not. These events cannot be windowed reasonably since they can be weeks or event years apart. Conceptually: In order to process an event, I have to lookup the first event of the group and check if the marker is set. Later on, I have to join the group key the relevant events with a events of a second and third stream. 

Doing this in DLT streaming tables will prob be too slow, since row based lookups are prob not supported (in my understanding).
Is there a best practice using a fast, maintainable lookup structure like a Postgres table holding just group keys where I can do a lookup per event?

thank you in advance  

 

1 ACCEPTED SOLUTION

Accepted Solutions

mark_ott
Databricks Employee
Databricks Employee

You can achieve efficient lookups for group markers in a Databricks DLT pipeline by storing just the relevant group keys (those where the marker is set) in a fast, separate lookup structure and joining your streams against this structure. While DLT streaming tables are not optimized for row-based, low-latency lookups—especially when historical events might span weeks or years—using a small, targeted dimension table or a database like Postgres for the lookup is a solid best practice for maintainability and performance.​

Best Practice: Stream-Static Joins

  • Databricks supports "stream-static" joins, where you join incoming streaming events with a static or slowly-changing reference dataset (such as a filtered Delta table or external lookup DB) on the group key.

  • Only new events from your Kinesis stream are joined with the current (filtered) static table, rather than rejoining all prior history, which is efficient and scalable for your use case.​

  • If the lookup table is small and mostly stable, you can cache it in memory or use broadcast joins to speed up the operation.​

Options for Fast Lookup

  • Delta Table as Lookup: Periodically update a Delta table (partitioned and Z-ordered by group key for faster scan/skipping) with just the keys where the marker was set. This allows rapid join/filter on relevant group keys for your event streams.​

  • External DB (e.g., Postgres): For very fast point lookups, you can maintain a compact table of group keys in Postgres. Your pipeline logic filters events by consulting Postgres for key relevance, which works well if key set is small to moderate and read latency is key.​

  • Materialized View/CDC: Keep an up-to-date materialized view or use Change Data Capture (CDC) logic to incrementally refresh the lookup set, minimizing the data scanned during each join.​

Maintainability Considerations

  • Whatever the storage choice (Delta or Postgres), keep the lookup table small (just group keys), update regularly, and ensure indexes/partitions match the group key.

  • For ease of integration and scaling within Databricks, Delta tables with incremental ('stream-static') joins are often the preferred route unless ultra-low latency/very high lookup volumes push you to an external store.

In summary, use a reference lookup table (Delta or Postgres) holding only relevant group keys for efficient event stream filtering and joining in your DLT pipeline. This approach is both maintainable and performant for your described scenario.​

View solution in original post

2 REPLIES 2

mark_ott
Databricks Employee
Databricks Employee

You can achieve efficient lookups for group markers in a Databricks DLT pipeline by storing just the relevant group keys (those where the marker is set) in a fast, separate lookup structure and joining your streams against this structure. While DLT streaming tables are not optimized for row-based, low-latency lookups—especially when historical events might span weeks or years—using a small, targeted dimension table or a database like Postgres for the lookup is a solid best practice for maintainability and performance.​

Best Practice: Stream-Static Joins

  • Databricks supports "stream-static" joins, where you join incoming streaming events with a static or slowly-changing reference dataset (such as a filtered Delta table or external lookup DB) on the group key.

  • Only new events from your Kinesis stream are joined with the current (filtered) static table, rather than rejoining all prior history, which is efficient and scalable for your use case.​

  • If the lookup table is small and mostly stable, you can cache it in memory or use broadcast joins to speed up the operation.​

Options for Fast Lookup

  • Delta Table as Lookup: Periodically update a Delta table (partitioned and Z-ordered by group key for faster scan/skipping) with just the keys where the marker was set. This allows rapid join/filter on relevant group keys for your event streams.​

  • External DB (e.g., Postgres): For very fast point lookups, you can maintain a compact table of group keys in Postgres. Your pipeline logic filters events by consulting Postgres for key relevance, which works well if key set is small to moderate and read latency is key.​

  • Materialized View/CDC: Keep an up-to-date materialized view or use Change Data Capture (CDC) logic to incrementally refresh the lookup set, minimizing the data scanned during each join.​

Maintainability Considerations

  • Whatever the storage choice (Delta or Postgres), keep the lookup table small (just group keys), update regularly, and ensure indexes/partitions match the group key.

  • For ease of integration and scaling within Databricks, Delta tables with incremental ('stream-static') joins are often the preferred route unless ultra-low latency/very high lookup volumes push you to an external store.

In summary, use a reference lookup table (Delta or Postgres) holding only relevant group keys for efficient event stream filtering and joining in your DLT pipeline. This approach is both maintainable and performant for your described scenario.​

Mathias_Peters
Contributor II

hi @mark_ott , thank you for your help. 

I have a follow up question regarding data completeness and out of order processing. I have decided to go with the delta table option since super low latency is not an issue and since this option has (seemingly) the lowest maintenance effort. 

I am wondering, how to ensure data completeness in the following scenario: I have created the lookup table consent_ids using a dlt pipeline reading from another dlt table consent_source. Now, I want to implement the actual filter logic which reads all the events from consent_source and uses a stream static join with consent_ids. 

The events stored in consent_source should be roughly in order, since they come from a Kinesis stream which is sharded by consent_id (meaning the events with the same consent id should be ordered, while events with different consent ids can be out of order w.r.t. event time). 

The consent_id stored in the look up table are extracted from the very first event in a sequence of events per event id. How can I ensure, that each consent_id is already stored in the lookup table when it is needed during the stream static join? Since I have to store the final result of my flow in ljson files and not in tables, I assume one coordinated dlt pipeline is not feasible here. My plan is to wrap both parts into a Databricks job.