Databricks Community

alexandrexixe · ‎09-11-2024

I'm working on a Databricks implementation project where external Kafka processes write JSON files to S3. I need to ingest these files daily, or in some cases every four hours, but I don't need to perform stream processing.

I'm considering two approaches to bring these files into a Delta Lake using Unit Catalog enviroment:

1 - Using Autoloader in batch mode: I could use Autoloader in batch mode to bring these files directly into a Delta bronze layer.

2- Creating external tables: I could create external tables from these files and use them as a bronze layer.

Do these approaches make sense?
What are the advantages and disadvantages of each?
Is there any other better aproach?

filipniziol · ‎09-11-2024

Hi @alexandrexixe ,

Are you building a production solution or you want to simply explore the data?
For something long-term I would recommend autoloader option.
Having external tables you do not get the benefits of working with Delta tables: the queries will be slow, there will be no schema evolution, you won't have time travel etc.
Eventually there will be this single feature that you need, but it is not available when using external tables.

Databricks Community

Best approach for handling batch processess from cloud object storage.

Join Us as a Local Community Builder!

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops

Introducing Community Pulse — Your Weekly Databricks Roundup!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

Databricks DevConnect I Washington D.C.