cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Optimize table for joins using identity column

LiamS
New Contributor

Hi There,

 

I'm new to the delta table format so please bear with me if I've missed something obvious! I've migrated data from on prem. Sql to fabric and stored two related tables as delta tables. When I query data from these tables and join them based on a related key the query takes a significant amount of time. Ie 60 seconds for a limit 1000 sql query. Table 1 has c. 6m rows table 2 maybe 1m. The data types are currently string but I can change this to integer should it help. The keys are integers from sql but I've stored them as string format for now.

 

Is them being a string hindering performance or should I employ an optimisation technique such as Z ordering? (I have tried Z ordering but it has no impact on the files.)

 

I am using pyspark in a notebook in ms fabric which I understand runs delta 2.3. I believe later versions (those on databricks) also support an auto incrementing identity column which isn't in place here. 

1 ACCEPTED SOLUTION

Accepted Solutions

Sidhant07
Databricks Employee
Databricks Employee

Hi,

You mentioned that you have tried Z-ordering but it didn't impact the performance. Z-ordering is a technique that co-locates related information in the same set of files. It works best when the data is filtered by the column specified in the Z-ordering. If your queries are not filtering on the Z-ordered column, you might not see a significant performance improvement.
https://docs.databricks.com/en/delta/data-skipping.html

https://www.databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime...

https://www.databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dynamic-file-prunin...

 

View solution in original post

1 REPLY 1

Sidhant07
Databricks Employee
Databricks Employee

Hi,

You mentioned that you have tried Z-ordering but it didn't impact the performance. Z-ordering is a technique that co-locates related information in the same set of files. It works best when the data is filtered by the column specified in the Z-ordering. If your queries are not filtering on the Z-ordered column, you might not see a significant performance improvement.
https://docs.databricks.com/en/delta/data-skipping.html

https://www.databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime...

https://www.databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dynamic-file-prunin...

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group