Databricks Community

alexandrexixe · ‎09-11-2024

I'm working on a Databricks implementation project where external Kafka processes write JSON files to S3. I need to ingest these files daily, or in some cases every four hours, but I don't need to perform stream processing.

I'm considering two approaches to bring these files into a Delta Lake using Unit Catalog enviroment:

1 - Using Autoloader in batch mode: I could use Autoloader in batch mode to bring these files directly into a Delta bronze layer.

2- Creating external tables: I could create external tables from these files and use them as a bronze layer.

Do these approaches make sense?
What are the advantages and disadvantages of each?
Is there any other better aproach?

filipniziol · ‎09-11-2024

Hi @alexandrexixe ,

Are you building a production solution or you want to simply explore the data?
For something long-term I would recommend autoloader option.
Having external tables you do not get the benefits of working with Delta tables: the queries will be slow, there will be no schema evolution, you won't have time travel etc.
Eventually there will be this single feature that you need, but it is not available when using external tables.

Databricks Community

Best approach for handling batch processess from cloud object storage.

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon