This post will help you simplify your data ingestion by utilizing Auto Loader, Delta Optimized Writes, Delta Write Jobs, and Delta Live Tables.
Pre-Req:
- You are using JSON data and Delta Writes commands
Step 1: Simplify ingestion with Auto Loader
Delta Lake helps unlock the full capabilities of working with JSON data in Databricks. Auto Loader makes it easy to ingest JSON data and manage semi-structured data in the Databricks Lakehouse.
Get hands on and import this notebook for a walkthrough on continuous and scheduled ingest of JSON data with Auto Loader.
If you want to learn more, check out this overview blog and short video, and come back to this post to follow Steps 2-3.
Step 2: Reduce latency by optimizing your writes to Delta tables
Now that youโre using Delta tables, reduce latency when reading by running Auto Optimize to automatically compact small files during individual writes.
Set your tableโs properties to
delta.autoOptimize.optimizeWrite = true and delta.autoOptimize.autoCompact = true
in the CREATE TABLE command
Tip: Tables with many active queries and latency requirements (in the order of minutes) benefit most from Auto Optimize.
Find examples here for enabling Auto Optimize on all tables.
Step 3: Set up automated ETL processing
Finally, use Databricks workflows and jobs to author, manage, and orchestrate ingestion of your semi-structured and streaming data.
Here's a quick walkthrough on How to Schedule a Job and Automate a Workload.
Did you know Databricks also provides powerful ETL capabilities with Delta Live Tables (DLT)? With DLT, treat your data as code and apply software engineering best practices like testing, monitoring and documentation to deploy reliable pipelines at scale.
To learn more about DLT...
- Follow the DLT Getting Started Guide
- Watch a demo
- Download example notebooks
- Join the DLT discussions in the Databricks Community
Congrats you have now optimized your data ingestion to get the most out of your data!
Drop your questions, feedback, and tips below!