Cross-posting from: https://community.databricks.com/t5/data-engineering/autoloader-pros-cons-when-extracting-data/td-p/...
Hi there, I am interested in using AutoLoader, but I'd like to get a bit of clarity if it makes sense in my case. Based on examples I've seen, an ideal use-case for AutoLoader is when we have some kind of landing path where we expect raw files to arrive (csv/json/xml/etc.), where we can have AutoLoader effectively scan for new files only and then proceed to append those to a raw table.
In the case where we would need to take it upon ourselves to extract the data from APIs (i.e. not yet available as raw files), would there be any point/reason in "landing" the data first prior to using AutoLoader to load the data into respective raw tables? Why not just load the data directly into raw delta tables at the time of data extraction?
I can easily see data duplication being a con and potential reason to skip the landing step altogether, but are there any benefits I might be missing to landing the raw data first prior to loading it into the raw tables? I would greatly appreciate some feedback on this!
Below is a sample code highlighting this potential option to skip landing + AutoLoader:
import json
import pandas as pd
# Assuming response.text contains JSON data
data = json.loads(response.text)
df = pd.DataFrame(data)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SaveToDelta").getOrCreate()
# Convert Pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(df)
# Write to Delta table
spark_df.write.format("delta").mode("overwrite").saveAsTable("your_delta_table_name")
# Or to a specific path:
# spark_df.write.format("delta").mode("overwrite").save("/path/to/your/delta_table")