Re: HELP! Converting GZ JSON to Delta causes massi...

Dooley · ‎06-15-2022

I would suggest to leave the autoloader to figure out the best way to backfill all your files instead of trying to do the increment yourself by your own schema to backfill. upload_path = ('s3://test-data/calendar/2022/01/01/*'.format(year, month, day)) <--the problem I see here is that you have no place you are putting the year, month, or day. Maybe you mean this?

upload_path = ('s3://test-data/calendar/{}/{}/{}/*'.format(year, month, day))

I mentioned that you should not repartition because with the code I specified before, you will have a compact version of a delta table that is made up of very large files. So the fact you have many small files after doing the auto-optimize step, auto-loader stream read & last checkpoint I mentioned is unusual. Did you do a

DESCRIBE TABLE EXTEND user_data_bronze_compact

and get the location? Did you then in the next cell do an

%fs 
ls file_path_of_delta_tabel

to see the size of the files? What are the sizes you are seeing?