Hi @EDDatabricks EDDatabricks , Let me try to answer all the questions:
- The log is not logical. It is the actual size of the log. If the size of the _delta_log is large, that is because of no. of delta versions created and retained in the table. By default, the delta Log rentention is 30 days. You can consider reducing this no. but this will affect the ability to timetravel to previous versions.
- The time taken by vacuum depends on the no. of folders and files that are present in the table in total.
- Vacuum deletes only the stale files that are present in the table i.e. Any files that are no longer referenced. If there is no significant reduction in size, it means that most of the files are still being used by the table.
- In absence of vacuum commands, there are more files in the table. Hence, to run a select query, the job has to parse through more files and this can affect the performance of the query. You should run the optimize and vacuum commands frequently on the table.
- Vacuum commands should be run frequently on the table and before running vacuum query, you should run the optimize query on the table. Optimize query compacts the data and reduces the no. of files in the table by combining the small files into large files. As a process, it generates stale files which vacuum command can be used to delete.
Also, in addition to this, here are some additional insights from the information provided:
Table data is 7.5 TB and there are 500K files which means the average file size is 25 MB which is actually very less. My understanding is that you have partition strategy in your table which is generating very small partitions which is not efficient.
It is a good practice to run optimize and vacuum command regularly. But from the information provided, it looks like you have not run the vacuum commands in a long time. As there is more work for to do here for vacuum and optimize, it will take more time initially. But once the workload decreases, the time taken will also come down.