Databricks Community

Erik · ‎10-01-2023

We have a setup where we process sensor data in databricks using pyspark structured streaming from kafka streams, and continuisly write these to delta tables. These delta tables are served through a SQL warehouse endpoint to the users. We also store all the messages on long-term storrage, so we can backfill and recompute if needed. The cool thing about this is that we use the exact same code for both the hot-path and the cold/backfill-path, and there is a single endpoint (the SQL warehouse) the users use to get both old and fresh data. The bad thing is that the hot path is... not so hot. It is clear that its not really optiomal to continuously write tiny new amount of data to the delta tables, and writing ends up taking around 5-10 seconds (even if computing is sub-second). Not to mention the trillion of tiny files we end up with (but those can be handled with optimize). So the end-to-end latency is around 10 seconds, where the majority of this comes from just writing to the delta tables.

What I want is a solution where I can keep doing the transformations (both hot and cold) in databricks, but get down the end-to-end latency (kafka to end-user), and prefferably keep having a single endpoint the user can query to get both old and new data. And for that I want your help!

So far I see some possible alternatives. All of them involve running a separate structured streaming job which pushes data somewhere:

1. Maybe it is possible to create some in-memory table in the SQL warehouse where I continuously push new data (while removing old), and then the end-user can access a view which is the union of this table and the delta-backed permanent store. I have investigated this a bit, and since global temp tables are dissabled in SQL warehouse I am uncertain if this can be done?

2a. I can push the data to a postgresql database with timescaledb installed, and then I can give the SQL warehouse access to it by query federation. The end-user can access a view which is the union of the postgresql database and the delta-backed permanent storage. The issue is that window functions is not pushed down to postgresql, so I am quite uncertain what happens when the user asks for a resampling on the view which is the union. Does it fetch everything from postgresql? How will that work in practice? Or does it just crash? ( I can't test since we dont have unity catalog yet, but that will come).

2b. As 2a, but I put some server in front of postgresql and SQL warehouse which handles spreading the query across. This might be what I have to end up with, but is is the least desirable since it involves writing custom bespoke code for all the possible queries we might want to do across hot and lukewarm data. Maybe there exists some library somewhere I have not heard about which can do query federation between databricks sql and something else?

Databricks Community

Hot path event processing and serving in databricks

Photos

Join Us as a Local Community Builder!

Virtual Learning Festival: 9 April - 30 April

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks Community Champion - March 2025 - Takuya Omi