Data Engineering

Forum Posts

Sorted by:

by Srikanth_Gupta_ • Databricks Employee

06-09-2021 9:12:01 PM

4375 Views
2 replies
1 kudos

How can I use data skipping with Delta Lake

How does data skipping work with delta lake, can I run ANALYZE TABLE COMPUTE STATISTICS with Delta lake? or Zorder going to solve these problems?

Data Engineering

4375 Views
2 replies
1 kudos

06-09-2021 9:12:01 PM

View Replies

Latest Reply

Anonymous
New Contributor III

06-27-2023 7:52:31 PM

1 kudos

You can use Zorder with indexes for data skipping. Data skipping information is collected automatically when you write to delta table. Delta lake uses this information to provide faster query.You dont need to configure anything for data skipping as t...

1 kudos

06-27-2023 7:52:31 PM

1 More Replies

by shubhadip • New Contributor

05-31-2023 8:42:56 AM

1916 Views
1 replies
0 kudos

If we do z-order on a particular column will delta log stats collection be affected?

Let's assume a table contains more than 40 columns, now we know it automatically collects stat for the first 32 columns. If we run a z-order on a particular column(let's say column 1), then will the log file collect stats for all the 32 columns or wi...

Data Engineering

1916 Views
1 replies
0 kudos

05-31-2023 8:42:56 AM

View Replies

Latest Reply

Anonymous
Not applicable

06-20-2023 5:19:45 AM

0 kudos

@Shubhadip Ghosh : Hope this helps. In Delta Lake, when you perform Z-Ordering on a particular column, it reorganizes the data within the files based on the values of that column. However, Z-Ordering itself does not directly affect the statistics co...

0 kudos

06-20-2023 5:19:45 AM

by shubhadip • New Contributor

05-31-2023 8:58:56 AM

1807 Views
1 replies
0 kudos

Will consecutive delete insert affect z-ordering?

Let's say there is a delta table with a date field as its partition. In a table where condition, we delete all the rows according to the division. The data is currently being inserted into the same date field. If we do a z-order after inserting the d...

Data Engineering

1807 Views
1 replies
0 kudos

05-31-2023 8:58:56 AM

View Replies

Latest Reply

Anonymous
Not applicable

06-20-2023 5:18:52 AM

0 kudos

@Shubhadip Ghosh :In Delta Lake, when you perform a delete operation on a table, it doesn't physically remove the data from the files. Instead, it marks the affected rows for deletion by adding a tombstone marker to the Delta transaction log. This e...

0 kudos

06-20-2023 5:18:52 AM

by zeta_load • Databricks Partner

05-16-2023 12:52:43 AM

5280 Views
3 replies
2 kudos

Resolved! Z-orderiing df using python

Is there a way to perform Z-ordering using python? With sql you you should be able to use:%sql OPTIMIZE df ZORDER BY (column)however I get the error "Table or view 'df' not found in database 'default''" and since I'm not really using sql, I would lik...

Data Engineering

5280 Views
3 replies
2 kudos

05-16-2023 12:52:43 AM

View Replies

Latest Reply

Anonymous
Not applicable

05-23-2023 1:51:57 AM

2 kudos

Hi @Lukas Goldschmied Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ans...

2 kudos

05-23-2023 1:51:57 AM

2 More Replies

by User16835756816 • Databricks Employee

01-23-2023 3:55:06 PM

5898 Views
3 replies
1 kudos

How can I optimize my data pipeline?

Delta Lake provides optimizations that can help you accelerate your data lake operations. Here’s how you can improve query speed by optimizing the layout of data in storage.There are two ways you can optimize your data pipeline: 1) Notebook Optimizat...

Data Engineering

5898 Views
3 replies
1 kudos

01-23-2023 3:55:06 PM

View Replies

Latest Reply

Hubert-Dudek
Databricks MVP

01-24-2023 10:40:50 AM

1 kudos

some tips from me:Look for data skews; some partitions can be huge, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()), especially SQL can divide it unequally to parti...

1 kudos

01-24-2023 10:40:50 AM

2 More Replies

by Erik • Valued Contributor III

11-05-2021 11:45:45 AM

5347 Views
4 replies
2 kudos

Resolved! Does Z-ordering speed up reading of a single file?

Situation: we have one partion per date, and it just so happens that each partition ends up (after optimize) as *a single* 128mb file. We partition on date, and zorder on userid, and our query is something like "find max value of column A where useri...

Data Engineering

5347 Views
4 replies
2 kudos

11-05-2021 11:45:45 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

11-07-2021 10:52:51 PM

2 kudos

Z-Order will make sure that in case you need to read multiple files, these files are co-located.For a single file this does not matter as a single file is always local to itself.If you are certain that your spark program will only read a single file,...

2 kudos

11-07-2021 10:52:51 PM

3 More Replies

by User16790091296 • Databricks Employee

05-28-2021 12:22:29 PM

6106 Views
1 replies
0 kudos

What's the difference between Z-Ordering and Partitioning?

Data Engineering

6106 Views
1 replies
0 kudos

05-28-2021 12:22:29 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-24-2021 3:02:47 PM

0 kudos

Partitioning is a way of distributing the data by keys so that you can restrict the amount of data scanned by each query and improve performance / avoid conflicts General rules of thumb for choosing the right partition columns Cardinality of a colu...

0 kudos

06-24-2021 3:02:47 PM

by aladda • Databricks Employee

06-23-2021 9:11:10 PM

2464 Views
1 replies
0 kudos

Resolved! What's the recommended number of columns you Z-order a Delta table by

Data Engineering

2464 Views
1 replies
0 kudos

06-23-2021 9:11:10 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-23-2021 9:12:46 PM

0 kudos

Z-ordering is generally effective on up to 3-4 columns and New clustering algorithm in DBR 7.6 can even go upto 5 columns. However, the key is to Z-order on columns that are typically used in filters/where predicates and joins.

0 kudos

06-23-2021 9:12:46 PM

by aladda • Databricks Employee

05-28-2021 12:23:24 PM

79915 Views
2 replies
1 kudos

Resolved! What is Z-ordering in Delta and what are some best practices on using it?

Data Engineering

79915 Views
2 replies
1 kudos

05-28-2021 12:23:24 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-19-2021 8:25:11 PM

1 kudos

Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. Syntax fo...

1 kudos

06-19-2021 8:25:11 PM

1 More Replies

Databricks Community

How can I use data skipping with Delta Lake

If we do z-order on a particular column will delta log stats collection be affected?

Will consecutive delete insert affect z-ordering?

Resolved! Z-orderiing df using python

How can I optimize my data pipeline?

Resolved! Does Z-ordering speed up reading of a single file?

What's the difference between Z-Ordering and Partitioning?

Resolved! What's the recommended number of columns you Z-order a Delta table by

Resolved! What is Z-ordering in Delta and what are some best practices on using it?