Databricks Community

ChristianRRL · ‎08-04-2025

Cross-posting from: https://community.databricks.com/t5/data-engineering/autoloader-pros-cons-when-extracting-data/td-p/...

Hi there, I am interested in using AutoLoader, but I'd like to get a bit of clarity if it makes sense in my case. Based on examples I've seen, an ideal use-case for AutoLoader is when we have some kind of landing path where we expect raw files to arrive (csv/json/xml/etc.), where we can have AutoLoader effectively scan for new files only and then proceed to append those to a raw table.

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/patterns

In the case where we would need to take it upon ourselves to extract the data from APIs (i.e. not yet available as raw files), would there be any point/reason in "landing" the data first prior to using AutoLoader to load the data into respective raw tables? Why not just load the data directly into raw delta tables at the time of data extraction?

I can easily see data duplication being a con and potential reason to skip the landing step altogether, but are there any benefits I might be missing to landing the raw data first prior to loading it into the raw tables? I would greatly appreciate some feedback on this!

Below is a sample code highlighting this potential option to skip landing + AutoLoader:

import json
import pandas as pd

# Assuming response.text contains JSON data
data = json.loads(response.text)
df = pd.DataFrame(data)

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SaveToDelta").getOrCreate()

# Convert Pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(df)

# Write to Delta table
spark_df.write.format("delta").mode("overwrite").saveAsTable("your_delta_table_name")
# Or to a specific path:
# spark_df.write.format("delta").mode("overwrite").save("/path/to/your/delta_table")

BS_THE_ANALYST · ‎08-07-2025

You’ve already identified data duplication as a potential con of landing the data first, but there are several benefits to this approach that might not be immediately obvious:

Schema Inference and Evolution: AutoLoader can automatically infer the schema of your data and adapt to changes over time (e.g., new fields in the API response). This reduces manual effort and makes it easier to handle evolving data structures.
Incremental Loading: AutoLoader processes only new files, improving performance and reducing compute costs compared to reprocessing all data. This is especially useful if your API extractions happen frequently or in batches.
Reprocessing Capability: With raw files in the landing zone, you can reprocess historical data for backfills, error corrections, or new transformations without relying on the API again. If the data is updated in the source system, you'll lose a history (should you need it again), this wouldn't happen if you had a file present in storage. You can argue delta lake has time travel available and solves some of this though.

I love the question, @ChristianRRL . Looking forward to seeing other responses.

All the best,
BS

View solution in original post

BS_THE_ANALYST · ‎08-07-2025

You’ve already identified data duplication as a potential con of landing the data first, but there are several benefits to this approach that might not be immediately obvious:

Schema Inference and Evolution: AutoLoader can automatically infer the schema of your data and adapt to changes over time (e.g., new fields in the API response). This reduces manual effort and makes it easier to handle evolving data structures.
Incremental Loading: AutoLoader processes only new files, improving performance and reducing compute costs compared to reprocessing all data. This is especially useful if your API extractions happen frequently or in batches.
Reprocessing Capability: With raw files in the landing zone, you can reprocess historical data for backfills, error corrections, or new transformations without relying on the API again. If the data is updated in the source system, you'll lose a history (should you need it again), this wouldn't happen if you had a file present in storage. You can argue delta lake has time travel available and solves some of this though.

I love the question, @ChristianRRL . Looking forward to seeing other responses.

All the best,
BS

Databricks Community

AutoLoader Pros/Cons When Extracting Data (Cross-Post)

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples