Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-10-2022 09:04 AM
Hi there,
Thank you for the advise!
I set up the Autoloader script this morning in a notebook and it appears to transfer over files fairly quickly. I added .option("cloudFiles.inferColumnTypes", "true") in-order to detect the schema.
Questions:
- How do we save the user_data_bronze_not_compact to a s3 path partition it by (yyyy/mm/dd)?
- How can we set it up so the Autoloader job only triggers once and stops when it has loaded all of the data in the s3 folder?
- We want to run Autoloader once a day to process the previous day’s data. At the moment we use upload_path = ('s3://test-data/calendar/{}/{}/{}'.format(year, month, day)) to load data for a specific day. Is there a better way to do this with Autoloader? Backfill? Incremental?
- In this query we run load data from GZ JSON into Delta and store it to a table (not optimized). Since we do not specify the location the table in S3, where is this table stored?
- When we optimize this data and store it in S3, we re-write it again so in essence we have 3 copies of this data now right? If so, do we need to run number 3? or can we optimize step 2?
- JSON
- JSON TO DELTA (Not optimized)
- DETLA to Optimized Delta (Optimized)
Thank you for the help!
Kash