Partitioning is a way of distributing the data by keys so that you can restrict the amount of data scanned by each query and improve performance / avoid conflicts
General rules of thumb for choosing the right partition columns
- Cardinality of a column should not be very high
- Amount of data in each partition should meet a minimum threshold
Now delta supports a feature called data skipping to speed up queries .
Z-odering is a multi-dimensional clustering approach to colocate related information in the same set of files so that databricks data-skipping algorithms can dramatically reduce the amount of data that needs to be read. This works somewhat like secondary indexes in terms of improving query read performance.