Optimize table for joins using identity column

LiamS — Sun, 08 Oct 2023 18:43:10 GMT

Hi There,

I'm new to the delta table format so please bear with me if I've missed something obvious! I've migrated data from on prem. Sql to fabric and stored two related tables as delta tables. When I query data from these tables and join them based on a related key the query takes a significant amount of time. Ie 60 seconds for a limit 1000 sql query. Table 1 has c. 6m rows table 2 maybe 1m. The data types are currently string but I can change this to integer should it help. The keys are integers from sql but I've stored them as string format for now.

Is them being a string hindering performance or should I employ an optimisation technique such as Z ordering? (I have tried Z ordering but it has no impact on the files.)

I am using pyspark in a notebook in ms fabric which I understand runs delta 2.3. I believe later versions (those on databricks) also support an auto incrementing identity column which isn't in place here.

Re: Optimize table for joins using identity column

Sidhant07 — Thu, 09 Nov 2023 10:51:35 GMT

Hi,

You mentioned that you have tried Z-ordering but it didn't impact the performance. Z-ordering is a technique that co-locates related information in the same set of files. It works best when the data is filtered by the column specified in the Z-ordering. If your queries are not filtering on the Z-ordered column, you might not see a significant performance improvement.
https://docs.databricks.com/en/delta/data-skipping.html

https://www.databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html

https://www.databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dynamic-file-pruning.html

topic Optimize table for joins using identity column in Data Engineering

Optimize table for joins using identity column

Re: Optimize table for joins using identity column