Best approach for handling batch processess from c...

alexandrexixe · ‎09-11-2024

I'm working on a Databricks implementation project where external Kafka processes write JSON files to S3. I need to ingest these files daily, or in some cases every four hours, but I don't need to perform stream processing.

I'm considering two approaches to bring these files into a Delta Lake using Unit Catalog enviroment:

1 - Using Autoloader in batch mode: I could use Autoloader in batch mode to bring these files directly into a Delta bronze layer.

2- Creating external tables: I could create external tables from these files and use them as a bronze layer.

Do these approaches make sense?
What are the advantages and disadvantages of each?
Is there any other better aproach?

Best approach for handling batch processess from cloud object storage.