Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-10-2022 01:26 PM
- I would suggest to not write the multitude of small parquet files to S3 since performance will be horrible compared to writing the delta format, less & larger file version of that same data - in our example that was called user_data_bronze_compact. I would not suggest partitioning any table less than 1TB and not to have a partition less than 1GB for performance reasons. Your write to S3 will be more efficient with the compact version of the table. You can try writing using foreachBatch() or foreach().
- Then take that bronze dataframe and use the trigger once option. See Triggers here.
- Autoloader can backfill with an increment using "cloudFiles.backfillInterval"
- You can find the location of the table in DESCRIBE TABLE EXTENDED user_data_bronze_compact at the bottom it says "location." you can see the files that make up that table using %fs and then ls file_path_you_grabbed_from_describe_table_extended_step
- You can do the turn on auto optimize step before you start the stream & skip the middle checkpoint.