Hi Team,
We want to create a delta table which have historical load of 10 TB of data, and we expect an incremental refresh of about 15 GB each day.
What factors should we take into account for managing such a large volume of data especially cost and performance wise?
I have a below things on my mind, please let me know if there's anything else we should consider.
1) We should have a good partition strategy in place, and we can also think about using Liquid clustering if we are unsure about the size of each partition.
2) It's important to do regular housekeeping tasks like vacuuming and z-ordering every week. Depending on how well the system is performing, we can adjust the frequency of these tasks to every other day.
3) Move any historical data older than 7 years into a separate table (like an Archive table) and focus on transactions with the active table. This decision should be based on the business needs and whether all the data is necessary for calculations.
4) When setting up the delta table, make sure to prioritize the most frequently used columns by listing them in the first 32 columns.
Could you please suggest any other performance factors needs to consider for such a big table
Regards,
Janga