cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best approach for handling batch processess from cloud object storage.

alexandrexixe
New Contributor

I'm working on a Databricks implementation project where external Kafka processes write JSON files to S3. I need to ingest these files daily, or in some cases every four hours, but I don't need to perform stream processing.

I'm considering two approaches to bring these files into a Delta Lake using Unit Catalog enviroment:

1 - Using Autoloader in batch mode: I could use Autoloader in batch mode to bring these files directly into a Delta bronze layer.

2- Creating external tables: I could create external tables from these files and use them as a bronze layer.


Do these approaches make sense?
What are the advantages and disadvantages of each?
Is there any other better aproach?

1 REPLY 1

filipniziol
Contributor

Hi @alexandrexixe ,

Are you building a production solution or you want to simply explore the data?
For something long-term I would recommend autoloader option. 
Having external tables you do not get the benefits of working with Delta tables: the queries will be slow, there will be no schema evolution, you won't have time travel etc.
Eventually there will be this single feature that you need, but it is not available when using external tables. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group