cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

AutoLoader Pros/Cons When Extracting Data (Cross-Post)

ChristianRRL
Valued Contributor III

Cross-posting from: https://community.databricks.com/t5/data-engineering/autoloader-pros-cons-when-extracting-data/td-p/...

Hi there, I am interested in using AutoLoader, but I'd like to get a bit of clarity if it makes sense in my case. Based on examples I've seen, an ideal use-case for AutoLoader is when we have some kind of landing path where we expect raw files to arrive (csv/json/xml/etc.), where we can have AutoLoader effectively scan for new files only and then proceed to append those to a raw table.

In the case where we would need to take it upon ourselves to extract the data from APIs (i.e. not yet available as raw files), would there be any point/reason in "landing" the data first prior to using AutoLoader to load the data into respective raw tables? Why not just load the data directly into raw delta tables at the time of data extraction?

I can easily see data duplication being a con and potential reason to skip the landing step altogether, but are there any benefits I might be missing to landing the raw data first prior to loading it into the raw tables? I would greatly appreciate some feedback on this!

Below is a sample code highlighting  this potential option to skip landing + AutoLoader:

 

import json
import pandas as pd

# Assuming response.text contains JSON data
data = json.loads(response.text)
df = pd.DataFrame(data)

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SaveToDelta").getOrCreate()

# Convert Pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(df)

# Write to Delta table
spark_df.write.format("delta").mode("overwrite").saveAsTable("your_delta_table_name")
# Or to a specific path:
# spark_df.write.format("delta").mode("overwrite").save("/path/to/your/delta_table")

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

BS_THE_ANALYST
Esteemed Contributor II

You’ve already identified data duplication as a potential con of landing the data first, but there are several benefits to this approach that might not be immediately obvious:

  1. Schema Inference and Evolution: AutoLoader can automatically infer the schema of your data and adapt to changes over time (e.g., new fields in the API response). This reduces manual effort and makes it easier to handle evolving data structures.
  2. Incremental Loading: AutoLoader processes only new files, improving performance and reducing compute costs compared to reprocessing all data. This is especially useful if your API extractions happen frequently or in batches.
  3. Reprocessing Capability: With raw files in the landing zone, you can reprocess historical data for backfills, error corrections, or new transformations without relying on the API again. If the data is updated in the source system, you'll lose a history (should you need it again), this wouldn't happen if you had a file present in storage. You can argue delta lake has time travel available and solves some of this though.

    I love the question, @ChristianRRL . Looking forward to seeing other responses.

    All the best,
    BS

View solution in original post

1 REPLY 1

BS_THE_ANALYST
Esteemed Contributor II

You’ve already identified data duplication as a potential con of landing the data first, but there are several benefits to this approach that might not be immediately obvious:

  1. Schema Inference and Evolution: AutoLoader can automatically infer the schema of your data and adapt to changes over time (e.g., new fields in the API response). This reduces manual effort and makes it easier to handle evolving data structures.
  2. Incremental Loading: AutoLoader processes only new files, improving performance and reducing compute costs compared to reprocessing all data. This is especially useful if your API extractions happen frequently or in batches.
  3. Reprocessing Capability: With raw files in the landing zone, you can reprocess historical data for backfills, error corrections, or new transformations without relying on the API again. If the data is updated in the source system, you'll lose a history (should you need it again), this wouldn't happen if you had a file present in storage. You can argue delta lake has time travel available and solves some of this though.

    I love the question, @ChristianRRL . Looking forward to seeing other responses.

    All the best,
    BS

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now