Re: HELP! Converting GZ JSON to Delta causes massi...

Kash · ‎06-10-2022

Hi there,

Thank you for the advise!

I set up the Autoloader script this morning in a notebook and it appears to transfer over files fairly quickly. I added .option("cloudFiles.inferColumnTypes", "true") in-order to detect the schema.

Questions:

How do we save the user_data_bronze_not_compact to a s3 path partition it by (yyyy/mm/dd)?
How can we set it up so the Autoloader job only triggers once and stops when it has loaded all of the data in the s3 folder?
We want to run Autoloader once a day to process the previous day’s data. At the moment we use upload_path = ('s3://test-data/calendar/{}/{}/{}'.format(year, month, day)) to load data for a specific day. Is there a better way to do this with Autoloader? Backfill? Incremental?
In this query we run load data from GZ JSON into Delta and store it to a table (not optimized). Since we do not specify the location the table in S3, where is this table stored?
When we optimize this data and store it in S3, we re-write it again so in essence we have 3 copies of this data now right? If so, do we need to run number 3? or can we optimize step 2?
1. JSON
2. JSON TO DELTA (Not optimized)
3. DETLA to Optimized Delta (Optimized)

Thank you for the help!

Kash