Hi there, I would appreciate some input on AutoLoader best practice. I've read that some people recommend that the latest data should be loaded in its rawest form into a raw delta table (i.e. highly nested json-like schema) and from that data the appropriate flattening/upserting into the "actual" target tables should be performed. But it may be argued that this is overcomplicating things and it may be better to simply generate a dataframe with the latest non-processed data (i.e. via AutoLoader or read.spark), apply the relevant transformations, and then load into the proper target tables.
Are there strong reasons to pick one approach over the other?
AutoLoader example:
schema_hints = 'elementData.element.data MAP<STRING, STRUCT<dataPoint: MAP<STRING, STRING>, values: ARRAY<MAP<STRING, STRING>>>>'
df = (spark.readStream
.format("cloudFiles").option("cloudFiles.format", "json")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.schemaHints", schema_hints)
.option("multiLine", "true")
.option("cloudFiles.schemaLocation", f"{raw_path}/schema")
.option("cloudFiles.schemaEvolutionMode", "rescue")
.load(f"{landing_path}/*/data")
.select("*", "_metadata")
)
###########################################################
# do or do not transform dataframe (i.e. flatten json data)
###########################################################
query = (df.writeStream
.outputMode("append")
.trigger(availableNow=True)
.option("checkpointLocation", f"{raw_path}/checkpoint")
.start(f"{raw_path}/data")
)
query.awaitTermination()