cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How can I simplify my data ingestion by processing the data as it arrives in cloud storage?

User16835756816
Valued Contributor

This post will help you simplify your data ingestion by utilizing Auto Loader, Delta Optimized Writes, Delta Write Jobs, and Delta Live Tables.

Pre-Req: 

  • You are using JSON data and Delta Writes commands

Step 1: Simplify ingestion with Auto Loader 

Delta Lake helps unlock the full capabilities of working with JSON data in Databricks. Auto Loader makes it easy to ingest JSON data and manage semi-structured data in the Databricks Lakehouse.

Get hands on and import this notebook for a walkthrough on continuous and scheduled ingest of JSON data with Auto Loader.

If you want to learn more, check out this overview blog and short video, and come back to this post to follow Steps 2-3.

Step 2: Reduce latency by optimizing your writes to Delta tables

Now that youโ€™re using Delta tables, reduce latency when reading by running Auto Optimize to automatically compact small files during individual writes.

Set your tableโ€™s properties to 

delta.autoOptimize.optimizeWrite = true and delta.autoOptimize.autoCompact = true 

in the CREATE TABLE command

Tip: Tables with many active queries and latency requirements (in the order of minutes) benefit most from Auto Optimize.

Find examples here for enabling Auto Optimize on all tables.

Step 3: Set up automated ETL processing

Finally, use Databricks workflows and jobs to author, manage, and orchestrate ingestion of your semi-structured and streaming data.

Here's a quick walkthrough on How to Schedule a Job and Automate a Workload.

Did you know Databricks also provides powerful ETL capabilities with Delta Live Tables (DLT)? With DLT, treat your data as code and apply software engineering best practices like testing, monitoring and documentation to deploy reliable pipelines at scale.

To learn more about DLT...

- Follow the DLT Getting Started Guide

- Watch a demo

- Download example notebooks

- Join the DLT discussions in the Databricks Community

Congrats you have now optimized your data ingestion to get the most out of your data!

Drop your questions, feedback, and tips below!

1 REPLY 1

youssefmrini
Honored Contributor III
Honored Contributor III

This post will help you simplify your data ingestion by utilizing Auto Loader, Delta Optimized Writes, Delta Write Jobs, and Delta Live Tables.

Pre-Req: 

  • You are using JSON data and Delta Writes commands

Step 1: Simplify ingestion with Auto Loader 

Delta Lake helps unlock the full capabilities of working with JSON data in Databricks. Auto Loader makes it easy to ingest JSON data and manage semi-structured data in the Databricks Lakehouse.

Get hands on and import this notebook for a walkthrough on continuous and scheduled ingest of JSON data with Auto Loader.

If you want to learn more, check out this overview blog and short video, and come back to this post to follow Steps 2-3.

Step 2: Reduce latency by optimizing your writes to Delta tables

Now that youโ€™re using Delta tables, reduce latency when reading by running Auto Optimize to automatically compact small files during individual writes.

Set your tableโ€™s properties to 

delta.autoOptimize.optimizeWrite = true and delta.autoOptimize.autoCompact = true 

in the CREATE TABLE command

Tip: Tables with many active queries and latency requirements (in the order of minutes) benefit most from Auto Optimize.

Find examples here for enabling Auto Optimize on all tables.

Step 3: Set up automated ETL processing

Finally, use Databricks workflows and jobs to author, manage, and orchestrate ingestion of your semi-structured and streaming data.

Here's a quick walkthrough on How to Schedule a Job and Automate a Workload.

Did you know Databricks also provides powerful ETL capabilities with Delta Live Tables (DLT)? With DLT, treat your data as code and apply software engineering best practices like testing, monitoring and documentation to deploy reliable pipelines at scale.

To learn more about DLT...

- Follow the DLT Getting Started Guide

Watch a demo

- Download example notebooks

- Join the DLT discussions in the Databricks Community

Congrats you have now optimized your data ingestion to get the most out of your data!

Drop your questions, feedback, and tips below!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.