Re: HELP! Converting GZ JSON to Delta causes massi...

Kash · ‎06-27-2022

Hi there,

Thanks for getting back to me.

My question is regarding backfilling and loading HISTORICAL data incrementally (day by day) using AutoLoader.

I would like to run Autoloader on data that is partitioned by year/month/day. I would like Autoloader to read this data incrementally and then write it incrementally to prevent CPU overloading and other memory issues.

When I run Autoloader today using the setup above, I see in the SparkUI that it is trying to load the entire 1TB s3 bucket into memory rather than reading it day-by-day (incrementally.)

Do I have the backfill setup incorrectly or am I missing something that can make Autoloader backfill daily first?

Thanks,

Avkash