Re: HELP! Converting GZ JSON to Delta causes massi...

Kash · ‎06-14-2022

Hi there,

I was able to set up a notebook but I am having difficulty getting it to backfill incrementally day by day.

I would like to process all of our 2022 data that is stored in a year/month/day format incrementally to reduce the load. I have a notebook setup that will iterate between a start_date_param and end_date_param so it run one job for each day to backfill.

When I use a specific upload path in AutoLoader for example

upload_path = ('s3://test-data/calendar/2022/01/01/*'.format(year, month, day)) I get this error.

java.lang.IllegalStateException: Found mismatched event: key calendar/2022/01/02/00-11-6a088a39-4180-4efe-852d-11d09e6c2eb8.json.gz doesn't have the prefix: calendar/2022/01/01/

When I do not specify the year/month/day Autoloader tries to load the entire directory for 2022 rather than doing it incrementally. I see in the SparkUI it's trying to load 49K files.

How do we set it up so it loads data for the first day..writes it and partitions it by day then goes onto the next day?

I saw that you mentitoed that we should not partiton by year/month/day as that slows down the read but then our S3 directory will have tons of small files.

Lastly, how do we set it up to partition by 1GB rather than optimize and write it in 10MB chunks which is what it's doing now with Auto-optmize and auto-compact?

I've also set .option('cloudFiles.backfillInterval', '1 day') \

and also tried .option('cloudFiles.backfillInterval', 1) \

Any thoughts?

Thank you again for your help!

Avkash