Best approach for handling batch processess from cloud object storage.

alexandrexixe — Wed, 11 Sep 2024 15:35:02 GMT

I'm working on a Databricks implementation project where external Kafka processes write JSON files to S3. I need to ingest these files daily, or in some cases every four hours, but I don't need to perform stream processing.

I'm considering two approaches to bring these files into a Delta Lake using Unit Catalog enviroment:

1 - Using Autoloader in batch mode: I could use Autoloader in batch mode to bring these files directly into a Delta bronze layer.

2- Creating external tables: I could create external tables from these files and use them as a bronze layer.

Do these approaches make sense?
What are the advantages and disadvantages of each?
Is there any other better aproach?

Re: Best approach for handling batch processess from cloud object storage.

filipniziol — Wed, 11 Sep 2024 19:26:10 GMT

Hi @alexandrexixe ,

Are you building a production solution or you want to simply explore the data?
For something long-term I would recommend autoloader option.
Having external tables you do not get the benefits of working with Delta tables: the queries will be slow, there will be no schema evolution, you won't have time travel etc.
Eventually there will be this single feature that you need, but it is not available when using external tables.

topic Re: Best approach for handling batch processess from cloud object storage. in Data Engineering

Best approach for handling batch processess from cloud object storage.

Re: Best approach for handling batch processess from cloud object storage.