Databricks Community

ChristianRRL · ‎10-07-2025

Hi there, I would appreciate some input on AutoLoader best practice. I've read that some people recommend that the latest data should be loaded in its rawest form into a raw delta table (i.e. highly nested json-like schema) and from that data the appropriate flattening/upserting into the "actual" target tables should be performed. But it may be argued that this is overcomplicating things and it may be better to simply generate a dataframe with the latest non-processed data (i.e. via AutoLoader or read.spark), apply the relevant transformations, and then load into the proper target tables.

Are there strong reasons to pick one approach over the other?

AutoLoader example:

schema_hints = 'elementData.element.data MAP<STRING, STRUCT<dataPoint: MAP<STRING, STRING>, values: ARRAY<MAP<STRING, STRING>>>>'

df = (spark.readStream
    .format("cloudFiles").option("cloudFiles.format", "json")
    .option("cloudFiles.inferColumnTypes", "true")
    .option("cloudFiles.schemaHints", schema_hints)
    .option("multiLine", "true")
    .option("cloudFiles.schemaLocation", f"{raw_path}/schema")
    .option("cloudFiles.schemaEvolutionMode", "rescue")
    .load(f"{landing_path}/*/data")
    .select("*", "_metadata")
)

###########################################################
# do or do not transform dataframe (i.e. flatten json data)
###########################################################

query = (df.writeStream
    .outputMode("append")
    .trigger(availableNow=True)
    .option("checkpointLocation", f"{raw_path}/checkpoint")
    .start(f"{raw_path}/data")
)

query.awaitTermination()

BS_THE_ANALYST · ‎10-07-2025

I think the key thing with holding the raw data in a table, and not transforming that table, is that you have more flexibility at your disposal. There's a great resource available via Databricks Docs for best practices in the Lakehouse. I'd highly recommend checking it out, and in particular, this section: https://docs.databricks.com/aws/en/lakehouse-architecture/reliability/best-practices#2-manage-data-q...

@ChristianRRL there is no one size fits all though, it'll depend on your use case. If is a simple use case, and you won't require as much flexibility, the simple option will suffice, of course. No need to overengineer every problem 👍.

All the best,
BS

View solution in original post

BS_THE_ANALYST · ‎10-07-2025

I think the key thing with holding the raw data in a table, and not transforming that table, is that you have more flexibility at your disposal. There's a great resource available via Databricks Docs for best practices in the Lakehouse. I'd highly recommend checking it out, and in particular, this section: https://docs.databricks.com/aws/en/lakehouse-architecture/reliability/best-practices#2-manage-data-q...

@ChristianRRL there is no one size fits all though, it'll depend on your use case. If is a simple use case, and you won't require as much flexibility, the simple option will suffice, of course. No need to overengineer every problem 👍.

All the best,
BS

Databricks Community

AutoLoader Ingestion Best Practice

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples