Databricks Community

5UDO · ‎05-02-2025

Hi everyone,

I just started using the Databricks and wanted to evaluate the reading speeds when using the Databricks warehouse.
So I've generated the dataset of 100M records, which contains name, surname, date of birth, phone number and an address. Dataset is saved to delta table.

Query response needs to reach speeds below 200ms when querying for specific person. Name, surname and date of birth columns are used for identifying the person.

I tried parititioning by year of date of birth, first letter of surname, then z ordering, bloom indexing, hashing, and combinations of it, but I never reached needed speeds.
The closest I got was with hashing name, surname and date of birth columns through pyspark.sql.funcitions.hash() function and then z ordering hash values. With that I managed to get an average around 270ms on sample of 200 records.
(Records are randomly sampled from a different delta table which has same data.)
This was measured using smallest warehose.

Does anyone know what other methods are available to improve the query time?
Where can I find concrete materials how to optimize querying and how can I recognise in which case which optimization method is better?
If you have any questions or need more information I will gladly ask.

Brahmareddy · ‎05-04-2025

Hi 5UDO,

How are you doing today?, As per my understanding, You're already doing a great job exploring advanced optimization techniques like partitioning, Z-Ordering, and even hashing, which shows you're thinking in the right direction. The fact that you’re close to your 200ms goal with around 270ms on a small warehouse is actually impressive given the size of your dataset. One suggestion would be to try using a larger warehouse temporarily to see how much that impacts latency, especially since Databricks SQL Warehouses are elastic and scale linearly for point lookups. Also, make sure that column stats are up to date by running ANALYZE TABLE COMPUTE STATISTICS FOR ALL COLUMNS—this helps with better query planning. Another angle is to materialize a lightweight index table with pre-hashed or encoded keys just for lookup purposes, or even cache hot subsets if the same queries run often. For more structured learning, Databricks’ official docs have a section on Delta Lake Performance Tuning, and the Databricks Academy also offers free and paid courses that cover best practices. Let me know if you want help designing a query strategy based on your access pattern!

Regards,

Brahma

EmilyTaylor · ‎05-07-2025

Thank you so much for the info you shared.

Brahmareddy · ‎05-04-2025

Here is the link - https://docs.databricks.com/aws/en/optimizations

AndrewN · ‎05-05-2025

Have you tried using Liquid Clustering for your table instead of partition/z-ordering?https://docs.databricks.com/aws/en/delta/clustering

5UDO · ‎05-06-2025

Hi Brahmareddy and AndrewN,
Thank you on your answers.

I first need to apologize as I accidentally wrote wrong that I got 270ms with hashing the date of birth, surname and name and then using the z ordering.
I actually achieved around 290ms with hashing the date of birth, surname and name and then using bloom indexing on the hash value column.

@Brahmareddy
For now I tried to use 2X Large warehouse, but I only saw better performance on unoptimized tables.
Tables which were running great had marginal improvement, maybe 10-20ms.
I even tried to run the same queries 2 times to see how fast is Local Cache for queries, but even that wasn't always faster than 200ms.
But even if execution time was always under 200ms I would need some different approach, since there is equal chance that any person could be queried at any time.

I tried to run ANALYZE TABLE COMPUTE STATISTICS FOR ALL COLUMNS but there was no improvement.
I assume because there were no changes on the table since it was created.

materialize a lightweight index table with pre-hashed or encoded keys just for lookup purposes

This is still something I need to test. But I'm also not sure what is the best way to do it in Databricks.

@AndrewN
I didn't yet tried Liquid Clustering, but, from what I've read, in some cases Z ordering is faster for reading data, while Liquid Clustering is faster for writing data.
Are there some cases where you got better reading speeds with LC than with Z ordering?

Brahmareddy · ‎05-06-2025

Hi 5UDO, No worries at all, and thanks for the clarification—it actually makes your results even more insightful! You're right in noting that Bloom filters on a hashed column can help narrow down files, but they don’t always guarantee fast reads if the underlying data layout isn’t tight. Also, as you observed, scaling up the warehouse size often improves performance on less optimized tables, but has diminishing returns once your data is already well-structured. You're spot on about Z-Ordering generally being better for read-heavy workloads, especially for highly selective queries like yours, while Liquid Clustering shines more for write-heavy or schema-evolving tables. That said, Liquid Clustering can outperform Z-Ordering in scenarios where data changes frequently or where file compaction and cluster-aware sorting offer better long-term layout maintenance—so it's worth trying if you want a more hands-off clustering method.

As for your lookup use case, building a materialized index table could help—you’d create a compact Delta table with just the hashed_id, maybe the original id, and the necessary dimensions for joins, and Z-Order that tiny table for quick binary-style lookups. It’s a pattern similar to building a pre-joined lookup table. Since any person could be queried at random, making lookup predictable and cacheable at the storage layer becomes key.

You’ve already done great testing—at this point, it’s about fine-tuning your access pattern. Let me know if you'd like help designing that index table or trying a hybrid LC + Z-Order setup to compare results!

Regards,

Brahma