Dooley
Databricks Employee
Databricks Employee
  1. I would suggest to not write the multitude of small parquet files to S3 since performance will be horrible compared to writing the delta format, less & larger file version of that same data - in our example that was called user_data_bronze_compact. I would not suggest partitioning any table less than 1TB and not to have a partition less than 1GB for performance reasons. Your write to S3 will be more efficient with the compact version of the table. You can try writing using foreachBatch() or foreach().
  2. Then take that bronze dataframe and use the trigger once option. See Triggers here.
  3. Autoloader can backfill with an increment using "cloudFiles.backfillInterval"
  4. You can find the location of the table in DESCRIBE TABLE EXTENDED user_data_bronze_compact at the bottom it says "location." you can see the files that make up that table using %fs and then ls file_path_you_grabbed_from_describe_table_extended_step
  5. You can do the turn on auto optimize step before you start the stream & skip the middle checkpoint.