cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Srikanth_Gupta_
by Databricks Employee
  • 2306 Views
  • 2 replies
  • 1 kudos

How can I use data skipping with Delta Lake

How does data skipping work with delta lake, can I run ANALYZE TABLE COMPUTE STATISTICS with Delta lake? or Zorder going to solve these problems?

  • 2306 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
New Contributor III
  • 1 kudos

You can use Zorder with indexes for data skipping. Data skipping information is collected automatically when you write to delta table. Delta lake uses this information to provide faster query.You dont need to configure anything for data skipping as t...

  • 1 kudos
1 More Replies
shubhadip
by New Contributor
  • 1266 Views
  • 1 replies
  • 0 kudos

If we do z-order on a particular column will delta log stats collection be affected?

Let's assume a table contains more than 40 columns, now we know it automatically collects stat for the first 32 columns. If we run a z-order on a particular column(let's say column 1), then will the log file collect stats for all the 32 columns or wi...

  • 1266 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Shubhadip Ghosh​ : Hope this helps. In Delta Lake, when you perform Z-Ordering on a particular column, it reorganizes the data within the files based on the values of that column. However, Z-Ordering itself does not directly affect the statistics co...

  • 0 kudos
shubhadip
by New Contributor
  • 1279 Views
  • 1 replies
  • 0 kudos

Will consecutive delete insert affect z-ordering?

Let's say there is a delta table with a date field as its partition. In a table where condition, we delete all the rows according to the division. The data is currently being inserted into the same date field. If we do a z-order after inserting the d...

  • 1279 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Shubhadip Ghosh​ :In Delta Lake, when you perform a delete operation on a table, it doesn't physically remove the data from the files. Instead, it marks the affected rows for deletion by adding a tombstone marker to the Delta transaction log. This e...

  • 0 kudos
zeta_load
by New Contributor II
  • 2618 Views
  • 3 replies
  • 2 kudos

Resolved! Z-orderiing df using python

Is there a way to perform Z-ordering using python? With sql you you should be able to use:%sql OPTIMIZE df ZORDER BY (column)however I get the error "Table or view 'df' not found in database 'default''" and since I'm not really using sql, I would lik...

  • 2618 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Lukas Goldschmied​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ans...

  • 2 kudos
2 More Replies
User16835756816
by Valued Contributor
  • 3730 Views
  • 3 replies
  • 1 kudos

How can I optimize my data pipeline?

Delta Lake provides optimizations that can help you accelerate your data lake operations. Here’s how you can improve query speed by optimizing the layout of data in storage.There are two ways you can optimize your data pipeline: 1) Notebook Optimizat...

  • 3730 Views
  • 3 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

some tips from me:Look for data skews; some partitions can be huge, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()), especially SQL can divide it unequally to parti...

  • 1 kudos
2 More Replies
Erik
by Valued Contributor III
  • 3899 Views
  • 4 replies
  • 2 kudos

Resolved! Does Z-ordering speed up reading of a single file?

Situation: we have one partion per date, and it just so happens that each partition ends up (after optimize) as *a single* 128mb file. We partition on date, and zorder on userid, and our query is something like "find max value of column A where useri...

  • 3899 Views
  • 4 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

Z-Order will make sure that in case you need to read multiple files, these files are co-located.For a single file this does not matter as a single file is always local to itself.If you are certain that your spark program will only read a single file,...

  • 2 kudos
3 More Replies
User16790091296
by Contributor II
  • 4052 Views
  • 1 replies
  • 0 kudos
  • 4052 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Partitioning is a way of distributing the data by keys so that you can restrict the amount of data scanned by each query and improve performance / avoid conflicts General rules of thumb for choosing the right partition columns   Cardinality of a colu...

  • 0 kudos
aladda
by Databricks Employee
  • 1608 Views
  • 1 replies
  • 0 kudos
  • 1608 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Z-ordering is generally effective on up to 3-4 columns and New clustering algorithm in DBR 7.6 can even go upto 5 columns. However, the key is to Z-order on columns that are typically used in filters/where predicates and joins.

  • 0 kudos
aladda
by Databricks Employee
  • 49661 Views
  • 2 replies
  • 1 kudos
  • 49661 Views
  • 2 replies
  • 1 kudos
Latest Reply
aladda
Databricks Employee
  • 1 kudos

Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. Syntax fo...

  • 1 kudos
1 More Replies
Labels