How do I choose which column to partition by?

User16826992666
Databricks Employee
Databricks Employee

I am in the process of building my data pipeline, but I am unsure of how to choose which fields in my data I should use for partitioning. What should I be considering when choosing a partitioning strategy?

brickster_2018
Databricks Employee
Databricks Employee

The important factors deciding partition columns are:

  • Even distribution of data.
  • Choose the column that is commonly or widely accessed or queried.
  • Do not create multiple levels of partition, as you can end up with a large number of small files.