Re: Incremental Loads from a Catalog/DLT

lprevost · ‎06-26-2024

Thank you for your reply. I've got some other reasons I don't want to do this with Autoloader as some upstream processes are already appending the catalog table with partitions.

Is there a way to do an incremental load from a catalog table with the partitions being the thing that is checkpointed?

I'm also having some problems with Autoloader as follows:

- I have a large number of gzip'd csv files. Most are manageable in size but a few are very large and spark gets hung on a few at the end of a group of tasks presumably because of skew towards the larger files. This takes an hour on some files with only one task running and all other nodes idle. I'm not sure that doing this without autoloader via catalog would solve this but I'm scratching my head as to how to solve.

I've read this is becasue gzip is unsplittable so it ties up one node until it has been completely read. the autoscaler has dropped out nodes that were almost finished reading so they start over.

I have about 650 files ranging in size from 30MB to 1.2GB.