topic Re: Best Practices for Daily Source-to-Bronze Data Ingestion in Databricks in Administration & Architecture

Best Practices for Daily Source-to-Bronze Data Ingestion in Databricks

JissMathew — Wed, 04 Dec 2024 06:46:16 GMT

How can we effectively manage source-to-bronze data ingestion from a project perspective, particularly when considering daily scheduling strategies using either Auto Loader or Serverless Warehouse COPY INTO commands?

Re: Best Practices for Daily Source-to-Bronze Data Ingestion in Databricks

Louis_Frolio — Wed, 04 Dec 2024 12:21:09 GMT

JissMathew, you have options when ingesting data from Cloud Storage into Delta Lake.

Your first approach should be with Delta Live Tables, it will be the easiest and most cost effective approach. Among other benefits it will manage the infrastructure for you so that you don't have to think about what type of compute resources you should be using.

Another option is to create structure streaming jobs and schedule using Workflows. Like DLT, you can leverage autoloader to pull files from cloud storage in a very efficient way.

DLT is the best way forward though.

Cheers, Louis.

Re: Best Practices for Daily Source-to-Bronze Data Ingestion in Databricks

JissMathew — Wed, 04 Dec 2024 16:57:41 GMT

@Louis_Frolio thank you for your reply , for implementing dlt is that we need multi node cluster ?

Re: Best Practices for Daily Source-to-Bronze Data Ingestion in Databricks

Louis_Frolio — Wed, 04 Dec 2024 18:06:51 GMT

No, it is not a strict requirement. You can have a single node job cluster run the job if the job is small.