cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How can I simplify my data ingestion by processing the data as it arrives in cloud storage?

User16835756816
Valued Contributor

This post will help you simplify your data ingestion by utilizing Auto Loader, Delta Optimized Writes, Delta Write Jobs, and Delta Live Tables.

Pre-Req: 

  • You are using JSON data and Delta Writes commands

Step 1: Simplify ingestion with Auto Loader 

Delta Lake helps unlock the full capabilities of working with JSON data in Databricks. Auto Loader makes it easy to ingest JSON data and manage semi-structured data in the Databricks Lakehouse.

Get hands on and import this notebook for a walkthrough on continuous and scheduled ingest of JSON data with Auto Loader.

If you want to learn more, check out this overview blog and short video, and come back to this post to follow Steps 2-3.

Step 2: Reduce latency by optimizing your writes to Delta tables

Now that youโ€™re using Delta tables, reduce latency when reading by running Auto Optimize to automatically compact small files during individual writes.

Set your tableโ€™s properties to 

delta.autoOptimize.optimizeWrite = true and delta.autoOptimize.autoCompact = true 

in the CREATE TABLE command

Tip: Tables with many active queries and latency requirements (in the order of minutes) benefit most from Auto Optimize.

Find examples here for enabling Auto Optimize on all tables.

Step 3: Set up automated ETL processing

Finally, use Databricks workflows and jobs to author, manage, and orchestrate ingestion of your semi-structured and streaming data.

Here's a quick walkthrough on How to Schedule a Job and Automate a Workload.

Did you know Databricks also provides powerful ETL capabilities with Delta Live Tables (DLT)? With DLT, treat your data as code and apply software engineering best practices like testing, monitoring and documentation to deploy reliable pipelines at scale.

To learn more about DLT...

- Follow the DLT Getting Started Guide

- Watch a demo

- Download example notebooks

- Join the DLT discussions in the Databricks Community

Congrats you have now optimized your data ingestion to get the most out of your data!

Drop your questions, feedback, and tips below!

1 REPLY 1

youssefmrini
Honored Contributor III
Honored Contributor III

This post will help you simplify your data ingestion by utilizing Auto Loader, Delta Optimized Writes, Delta Write Jobs, and Delta Live Tables.

Pre-Req: 

  • You are using JSON data and Delta Writes commands

Step 1: Simplify ingestion with Auto Loader 

Delta Lake helps unlock the full capabilities of working with JSON data in Databricks. Auto Loader makes it easy to ingest JSON data and manage semi-structured data in the Databricks Lakehouse.

Get hands on and import this notebook for a walkthrough on continuous and scheduled ingest of JSON data with Auto Loader.

If you want to learn more, check out this overview blog and short video, and come back to this post to follow Steps 2-3.

Step 2: Reduce latency by optimizing your writes to Delta tables

Now that youโ€™re using Delta tables, reduce latency when reading by running Auto Optimize to automatically compact small files during individual writes.

Set your tableโ€™s properties to 

delta.autoOptimize.optimizeWrite = true and delta.autoOptimize.autoCompact = true 

in the CREATE TABLE command

Tip: Tables with many active queries and latency requirements (in the order of minutes) benefit most from Auto Optimize.

Find examples here for enabling Auto Optimize on all tables.

Step 3: Set up automated ETL processing

Finally, use Databricks workflows and jobs to author, manage, and orchestrate ingestion of your semi-structured and streaming data.

Here's a quick walkthrough on How to Schedule a Job and Automate a Workload.

Did you know Databricks also provides powerful ETL capabilities with Delta Live Tables (DLT)? With DLT, treat your data as code and apply software engineering best practices like testing, monitoring and documentation to deploy reliable pipelines at scale.

To learn more about DLT...

- Follow the DLT Getting Started Guide

Watch a demo

- Download example notebooks

- Join the DLT discussions in the Databricks Community

Congrats you have now optimized your data ingestion to get the most out of your data!

Drop your questions, feedback, and tips below!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group