cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Optimize table for joins using identity column

LiamS
New Contributor

Hi There,

 

I'm new to the delta table format so please bear with me if I've missed something obvious! I've migrated data from on prem. Sql to fabric and stored two related tables as delta tables. When I query data from these tables and join them based on a related key the query takes a significant amount of time. Ie 60 seconds for a limit 1000 sql query. Table 1 has c. 6m rows table 2 maybe 1m. The data types are currently string but I can change this to integer should it help. The keys are integers from sql but I've stored them as string format for now.

 

Is them being a string hindering performance or should I employ an optimisation technique such as Z ordering? (I have tried Z ordering but it has no impact on the files.)

 

I am using pyspark in a notebook in ms fabric which I understand runs delta 2.3. I believe later versions (those on databricks) also support an auto incrementing identity column which isn't in place here. 

1 ACCEPTED SOLUTION

Accepted Solutions

Sidhant07
New Contributor III
New Contributor III

Hi,

You mentioned that you have tried Z-ordering but it didn't impact the performance. Z-ordering is a technique that co-locates related information in the same set of files. It works best when the data is filtered by the column specified in the Z-ordering. If your queries are not filtering on the Z-ordered column, you might not see a significant performance improvement.
https://docs.databricks.com/en/delta/data-skipping.html

https://www.databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime...

https://www.databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dynamic-file-prunin...

 

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @LiamSThe performance of your query can be affected by several factors, including the data types of the columns you're joining. Using string data type for the join keys might be hindering the performance. Converting the keys to the integer could improve performance as integer comparisons are generally faster than string comparisons.

Delta Lake provides several optimizations that can help improve the performance of your queries, including:-

- Data integrity constraints: These enforce data consistency and prevent accidental writes with incorrect or invalid schemas.
- Schema evolution: This allows for controlled schema evolution when business needs change.
- Transactional Guarantees: Writes to delta tables are ACID compliant, ensuring data consistency.
- Upserts: This feature updates data if it exists. Otherwise, it inserts new data. 

However, as you mentioned, Z-ordering did not improve the performance in your case.

This might be because Z-ordering is most effective when the data is filtered on the Z-ordered columns, which might not be the case in your scenario. In conclusion, you might want to consider changing the data type of your join keys from string to integer to improve the performance of your queries.

Sidhant07
New Contributor III
New Contributor III

Hi,

You mentioned that you have tried Z-ordering but it didn't impact the performance. Z-ordering is a technique that co-locates related information in the same set of files. It works best when the data is filtered by the column specified in the Z-ordering. If your queries are not filtering on the Z-ordered column, you might not see a significant performance improvement.
https://docs.databricks.com/en/delta/data-skipping.html

https://www.databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime...

https://www.databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dynamic-file-prunin...

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.