AutoLoader Ingestion Best Practice

ChristianRRL — Tue, 07 Oct 2025 20:54:31 GMT

Hi there, I would appreciate some input on AutoLoader best practice. I've read that some people recommend that the latest data should be loaded in its rawest form into a raw delta table (i.e. highly nested json-like schema) and from that data the appropriate flattening/upserting into the "actual" target tables should be performed. But it may be argued that this is overcomplicating things and it may be better to simply generate a dataframe with the latest non-processed data (i.e. via AutoLoader or read.spark), apply the relevant transformations, and then load into the proper target tables.

Are there strong reasons to pick one approach over the other?

AutoLoader example:

schema_hints = 'elementData.element.data MAP<STRING, STRUCT<dataPoint: MAP<STRING, STRING>, values: ARRAY<MAP<STRING, STRING>>>>' df = (spark.readStream .format("cloudFiles").option("cloudFiles.format", "json") .option("cloudFiles.inferColumnTypes", "true") .option("cloudFiles.schemaHints", schema_hints) .option("multiLine", "true") .option("cloudFiles.schemaLocation", f"{raw_path}/schema") .option("cloudFiles.schemaEvolutionMode", "rescue") .load(f"{landing_path}/*/data") .select("*", "_metadata") ) ########################################################### # do or do not transform dataframe (i.e. flatten json data) ########################################################### query = (df.writeStream .outputMode("append") .trigger(availableNow=True) .option("checkpointLocation", f"{raw_path}/checkpoint") .start(f"{raw_path}/data") ) query.awaitTermination()

Re: AutoLoader Ingestion Best Practice

BS_THE_ANALYST — Wed, 08 Oct 2025 06:02:41 GMT

I think the key thing with holding the raw data in a table, and not transforming that table, is that you have more flexibility at your disposal. There's a great resource available via Databricks Docs for best practices in the Lakehouse. I'd highly recommend checking it out, and in particular, this section: https://docs.databricks.com/aws/en/lakehouse-architecture/reliability/best-practices#2-manage-data-quality

@ChristianRRL there is no one size fits all though, it'll depend on your use case. If is a simple use case, and you won't require as much flexibility, the simple option will suffice, of course. No need to overengineer every problem 👍.

All the best,
BS

topic AutoLoader Ingestion Best Practice in Get Started Discussions

AutoLoader Ingestion Best Practice

Re: AutoLoader Ingestion Best Practice