cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

AutoLoader Pros/Cons When Extracting Data

ChristianRRL
Valued Contributor III

Hi there, I am interested in using AutoLoader, but I'd like to get a bit of clarity if it makes sense in my case. Based on examples I've seen, an ideal use-case for AutoLoader is when we have some kind of landing path where we expect raw files to arrive (csv/json/xml/etc.), where we can have AutoLoader effectively scan for new files only and then proceed to append those to a raw table.

In the case where we would need to take it upon ourselves to extract the data from APIs (i.e. not yet available as raw files), would there be any point/reason in "landing" the data first prior to using AutoLoader to load the data into respective raw tables? Why not just load the data directly into raw delta tables at the time of data extraction?

I can easily see data duplication being a con and potential reason to skip the landing step altogether, but are there any benefits I might be missing to landing the raw data first prior to loading it into the raw tables? I would greatly appreciate some feedback on this!

Below is a sample code highlighting  this potential option to skip landing + AutoLoader:

import json
import pandas as pd

# Assuming response.text contains JSON data
data = json.loads(response.text)
df = pd.DataFrame(data)

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SaveToDelta").getOrCreate()

# Convert Pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(df)

# Write to Delta table
spark_df.write.format("delta").mode("overwrite").saveAsTable("your_delta_table_name")
# Or to a specific path:
# spark_df.write.format("delta").mode("overwrite").save("/path/to/your/delta_table")

 

3 ACCEPTED SOLUTIONS

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

One reason to use landing path is that you will have an easier job when you need to reload table. In such case you don't have to re-pull all the data from API's once again.

Another thing, you have separation of concerns. Extraction part is done by one part of the pipeline, loading is done by another. So you have clearly defined, logical steps.

But if you feel that in your case it's much simpler to write directly to Delta Table then go for it. 

AutoLoader is great for processing large batches of files. It allows you to land raw API responses as files (e.g., JSON), then automatically and incrementally ingest them into Delta tables. It supports scalable file discovery, handles schema drift, and enables efficient deduplication and replay. It's based on spark structured streaming so in case of failure you can gracefully recover. 

View solution in original post

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

In your case, it's perfectly fine to load data directly into Delta tables (and skip any “extra” steps). However, let me share a few examples of how using Auto Loader has made our lives significantly easier on various projects.


For instance, in one project, IoT devices were sending data to a Data Lake. Imagine millions of files landing on your storage account. Auto Loader automatically detects which new files have arrived since the last load (and gives you out-of-the-box fault tolerance and exactly-once semantics).


Now, sometimes due to issues with the IoT devices, we would receive duplicate data.
That's why we made sure the process loading data into the Silver layer was idempotent (using a MERGE statement).
One more thing - check whether your API has any data retention limitations. For one client, we were integrating data from Google Search Console. It turned out that Google Search Console only retains data for the last 16 months. So in that case, we had to land the data first in a landing path in order to build a history longer than 16 months. Otherwise, if we ever needed to reprocess the data, we would lose historical records (since anything older than 16 months would no longer be available via the API).

View solution in original post

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

Great question. In my case, we use Autoloader exactly as you described - let's called it "batch mode" using .trigger(availableNow=True). Our pipeline runs once a day and uses Autoloader to load new data into the bronze layer.
Regarding compute recommendations: Databricks always recommends using job compute clusters. However, I don’t see anything wrong with using an all-purpose cluster (especially if you plan to use the ‘available now’ option).

But nonetheless, the official recommendation is to use job compute.

Production considerations for Structured Streaming - Azure Databricks | Microsoft Learn

PS. If the answer was helpful to you, consider marking it as the accepted answer. This way, we help others find the solution to their problem more quickly.

View solution in original post

8 REPLIES 8

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

One reason to use landing path is that you will have an easier job when you need to reload table. In such case you don't have to re-pull all the data from API's once again.

Another thing, you have separation of concerns. Extraction part is done by one part of the pipeline, loading is done by another. So you have clearly defined, logical steps.

But if you feel that in your case it's much simpler to write directly to Delta Table then go for it. 

AutoLoader is great for processing large batches of files. It allows you to land raw API responses as files (e.g., JSON), then automatically and incrementally ingest them into Delta tables. It supports scalable file discovery, handles schema drift, and enables efficient deduplication and replay. It's based on spark structured streaming so in case of failure you can gracefully recover. 

Hi @szymon_dybczak, you bring up some great points. What I'd like to narrow a bit more on is the "data duplication" side of the conversation. In the past, team members have brought up concerns with data duplication and the added complexity of having "extra steps" in our data ingestion process, so I want to make sure that we are properly addressing these points when considering using (or not) AutoLoader.

In your experience, how much of an impact in terms of cost would this data duplication have? And is the added complexity (extra steps) really worth it? As a counter-example, the code I shared would effectively load API data directly into a raw delta table and removes the need to land the data directly in their original .json files. The only downside I see to this approach would be that we don't have separation of concerns (a valid point), but if we chose to do it this way, I think I see basically the same benefits as with leveraging AutoLoader.

Please let me know if I missed anything!

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

In your case, it's perfectly fine to load data directly into Delta tables (and skip any “extra” steps). However, let me share a few examples of how using Auto Loader has made our lives significantly easier on various projects.


For instance, in one project, IoT devices were sending data to a Data Lake. Imagine millions of files landing on your storage account. Auto Loader automatically detects which new files have arrived since the last load (and gives you out-of-the-box fault tolerance and exactly-once semantics).


Now, sometimes due to issues with the IoT devices, we would receive duplicate data.
That's why we made sure the process loading data into the Silver layer was idempotent (using a MERGE statement).
One more thing - check whether your API has any data retention limitations. For one client, we were integrating data from Google Search Console. It turned out that Google Search Console only retains data for the last 16 months. So in that case, we had to land the data first in a landing path in order to build a history longer than 16 months. Otherwise, if we ever needed to reprocess the data, we would lose historical records (since anything older than 16 months would no longer be available via the API).

All great points! A few more follow-up questions.

I can see the benefits to using AutoLoader for large volume real time or near real time streaming data (readStream/writeStream under the hood after all). But I'm wondering, is this "overkill" for data ingestion that is more batch-oriented?

For example, while we do have some real or near real time use cases, a majority of our use cases are relating to either event or timeseries (10-min/15-min/hourly) data that we are typically ok to pull daily. While "streaming" the data is not strictly necessary, I see that I can effectively run it as a batch via the trigger option on writeStream:

.trigger(availableNow=True)

Also, for some additional context, we are currently running with all-purpose compute clusters in dev/prod. Does AutoLoader have suggested/preferred computes to run with in order to save (or avoid) overspending on compute resources?

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

Great question. In my case, we use Autoloader exactly as you described - let's called it "batch mode" using .trigger(availableNow=True). Our pipeline runs once a day and uses Autoloader to load new data into the bronze layer.
Regarding compute recommendations: Databricks always recommends using job compute clusters. However, I don’t see anything wrong with using an all-purpose cluster (especially if you plan to use the ‘available now’ option).

But nonetheless, the official recommendation is to use job compute.

Production considerations for Structured Streaming - Azure Databricks | Microsoft Learn

PS. If the answer was helpful to you, consider marking it as the accepted answer. This way, we help others find the solution to their problem more quickly.

Szymon, would serverless clusters provide a better option than either all-purpose / job clusters? My understanding is that Databricks focuses more on their serverless offerings these days.

Hi szymon! 

In a case when I have a file that it's overwriting each month, it's advisable use autoloader?

In my case, I have incremental loads for period ... 202501, 202502, 202503 ... if I overwrite every currently month, should I use autolader?

MartinIsti
New Contributor III

I am in the position to have more and more use-cases for AutoLoader. This might be an obvious question but if I understand correctly, Christian's use-case without the landing stage could still be an auto-loader scenario if the extracts from source system always land into a dedicated storage area (let that be ADLS or S3 or ...). That way you still don't have your own version of the files (with its pros and cons) but you can utilise AutoLoader's directory tracking ability.

My current requirement is to ingest data from an external S3 bucket and similarly try to evaluate the various approaches.