We have a setup where we process sensor data in databricks using pyspark structured streaming from kafka streams, and continuisly write these to delta tables. These delta tables are served through a SQL warehouse endpoint to the users. We also store all the messages on long-term storrage, so we can backfill and recompute if needed. The cool thing about this is that we use the exact same code for both the hot-path and the cold/backfill-path, and there is a single endpoint (the SQL warehouse) the users use to get both old and fresh data. The bad thing is that the hot path is... not so hot. It is clear that its not really optiomal to continuously write tiny new amount of data to the delta tables, and writing ends up taking around 5-10 seconds (even if computing is sub-second). Not to mention the trillion of tiny files we end up with (but those can be handled with optimize). So the end-to-end latency is around 10 seconds, where the majority of this comes from just writing to the delta tables.
What I want is a solution where I can keep doing the transformations (both hot and cold) in databricks, but get down the end-to-end latency (kafka to end-user), and prefferably keep having a single endpoint the user can query to get both old and new data. And for that I want your help!
So far I see some possible alternatives. All of them involve running a separate structured streaming job which pushes data somewhere:
1. Maybe it is possible to create some in-memory table in the SQL warehouse where I continuously push new data (while removing old), and then the end-user can access a view which is the union of this table and the delta-backed permanent store. I have investigated this a bit, and since global temp tables are dissabled in SQL warehouse I am uncertain if this can be done?
2a. I can push the data to a postgresql database with timescaledb installed, and then I can give the SQL warehouse access to it by query federation. The end-user can access a view which is the union of the postgresql database and the delta-backed permanent storage. The issue is that window functions is not pushed down to postgresql, so I am quite uncertain what happens when the user asks for a resampling on the view which is the union. Does it fetch everything from postgresql? How will that work in practice? Or does it just crash? ( I can't test since we dont have unity catalog yet, but that will come).
2b. As 2a, but I put some server in front of postgresql and SQL warehouse which handles spreading the query across. This might be what I have to end up with, but is is the least desirable since it involves writing custom bespoke code for all the possible queries we might want to do across hot and lukewarm data. Maybe there exists some library somewhere I have not heard about which can do query federation between databricks sql and something else?
Hi @Erik ,
- Use Delta Lake to optimize tables by coalescing small files into larger ones
- Use the
OPTIMIZE command on Delta Lake tables to improve write speed and reduce the number of small files
ZORDER for multi-dimensional clustering to further enhance query performance
- Leverage Structured Streaming in Databricks to process and write data incrementally with low latency
- Use the
writeStream method to write streaming query output to Delta Lake tables
- Table streaming reads and writes in Spark Structured Streaming can overcome limitations of streaming systems and files
- Table streaming can help coalesce small files produced by low-latency ingest
- Table streaming supports "exactly-once" processing
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!