Databricks Community

Mathias_Peters · a week ago

Hi,

I need to join three streams/streamed data sets in a DLT pipeline. I am reading from a Kinesis data stream a sequence of events per group key. The logically first of the events per group contains a marker which determines whether that group is relevant for me or not. These events cannot be windowed reasonably since they can be weeks or event years apart. Conceptually: In order to process an event, I have to lookup the first event of the group and check if the marker is set. Later on, I have to join the group key the relevant events with a events of a second and third stream.

Doing this in DLT streaming tables will prob be too slow, since row based lookups are prob not supported (in my understanding).
Is there a best practice using a fast, maintainable lookup structure like a Postgres table holding just group keys where I can do a lookup per event?

thank you in advance

mark_ott · Wednesday

You can achieve efficient lookups for group markers in a Databricks DLT pipeline by storing just the relevant group keys (those where the marker is set) in a fast, separate lookup structure and joining your streams against this structure. While DLT streaming tables are not optimized for row-based, low-latency lookups—especially when historical events might span weeks or years—using a small, targeted dimension table or a database like Postgres for the lookup is a solid best practice for maintainability and performance.

Best Practice: Stream-Static Joins

Databricks supports "stream-static" joins, where you join incoming streaming events with a static or slowly-changing reference dataset (such as a filtered Delta table or external lookup DB) on the group key.
Only new events from your Kinesis stream are joined with the current (filtered) static table, rather than rejoining all prior history, which is efficient and scalable for your use case.
If the lookup table is small and mostly stable, you can cache it in memory or use broadcast joins to speed up the operation.

Options for Fast Lookup

Delta Table as Lookup: Periodically update a Delta table (partitioned and Z-ordered by group key for faster scan/skipping) with just the keys where the marker was set. This allows rapid join/filter on relevant group keys for your event streams.
External DB (e.g., Postgres): For very fast point lookups, you can maintain a compact table of group keys in Postgres. Your pipeline logic filters events by consulting Postgres for key relevance, which works well if key set is small to moderate and read latency is key.
Materialized View/CDC: Keep an up-to-date materialized view or use Change Data Capture (CDC) logic to incrementally refresh the lookup set, minimizing the data scanned during each join.

Maintainability Considerations

Whatever the storage choice (Delta or Postgres), keep the lookup table small (just group keys), update regularly, and ensure indexes/partitions match the group key.
For ease of integration and scaling within Databricks, Delta tables with incremental ('stream-static') joins are often the preferred route unless ultra-low latency/very high lookup volumes push you to an external store.

In summary, use a reference lookup table (Delta or Postgres) holding only relevant group keys for efficient event stream filtering and joining in your DLT pipeline. This approach is both maintainable and performant for your described scenario.

View solution in original post

mark_ott · Wednesday

You can achieve efficient lookups for group markers in a Databricks DLT pipeline by storing just the relevant group keys (those where the marker is set) in a fast, separate lookup structure and joining your streams against this structure. While DLT streaming tables are not optimized for row-based, low-latency lookups—especially when historical events might span weeks or years—using a small, targeted dimension table or a database like Postgres for the lookup is a solid best practice for maintainability and performance.

Best Practice: Stream-Static Joins

Databricks supports "stream-static" joins, where you join incoming streaming events with a static or slowly-changing reference dataset (such as a filtered Delta table or external lookup DB) on the group key.
Only new events from your Kinesis stream are joined with the current (filtered) static table, rather than rejoining all prior history, which is efficient and scalable for your use case.
If the lookup table is small and mostly stable, you can cache it in memory or use broadcast joins to speed up the operation.

Options for Fast Lookup

Delta Table as Lookup: Periodically update a Delta table (partitioned and Z-ordered by group key for faster scan/skipping) with just the keys where the marker was set. This allows rapid join/filter on relevant group keys for your event streams.
External DB (e.g., Postgres): For very fast point lookups, you can maintain a compact table of group keys in Postgres. Your pipeline logic filters events by consulting Postgres for key relevance, which works well if key set is small to moderate and read latency is key.
Materialized View/CDC: Keep an up-to-date materialized view or use Change Data Capture (CDC) logic to incrementally refresh the lookup set, minimizing the data scanned during each join.

Maintainability Considerations

Whatever the storage choice (Delta or Postgres), keep the lookup table small (just group keys), update regularly, and ensure indexes/partitions match the group key.
For ease of integration and scaling within Databricks, Delta tables with incremental ('stream-static') joins are often the preferred route unless ultra-low latency/very high lookup volumes push you to an external store.

In summary, use a reference lookup table (Delta or Postgres) holding only relevant group keys for efficient event stream filtering and joining in your DLT pipeline. This approach is both maintainable and performant for your described scenario.