Seeking Advice on Data Lakehouse Architecture with Databricks

joshbuttler — Wed, 18 Sep 2024 10:19:12 GMT

I'm currently designing a data lakehouse architecture using Databricks and have a few questions. What are the best practices for efficiently ingesting both batch and streaming data into Delta Lake? Any recommended tools or approaches?

Re: Seeking Advice on Data Lakehouse Architecture with Databricks

szymon_dybczak — Wed, 18 Sep 2024 11:02:15 GMT

Hi @joshbuttler,

I think the best way is to use auto loader, which provides a highly efficient way to incrementally process new data, while also guaranteeing each file is processed exactly once.
It supports ingestion in a batch mode (Trigger.AvailableNow()) and you can also load data in streaming manner (under the hood it's using spark structured streaming). You have native support for variety of source files like JSON, PARQUET, CSV, XML to name a few and also integration with streaming data sources like Kafka, Kinesis or EventHub.

What is Auto Loader? - Azure Databricks | Microsoft Learn

topic Seeking Advice on Data Lakehouse Architecture with Databricks in Data Engineering

Seeking Advice on Data Lakehouse Architecture with Databricks

Re: Seeking Advice on Data Lakehouse Architecture with Databricks