cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

AutoLoader Ingestion Best Practice

ChristianRRL
Valued Contributor III

Hi there, I would appreciate some input on AutoLoader best practice. I've read that some people recommend that the latest data should be loaded in its rawest form into a raw delta table (i.e. highly nested json-like schema) and from that data the appropriate flattening/upserting into the "actual" target tables should be performed. But it may be argued that this is overcomplicating things and it may be better to simply generate a dataframe with the latest non-processed data (i.e. via AutoLoader or read.spark), apply the relevant transformations, and then load into the proper target tables.

Are there strong reasons to pick one approach over the other?

AutoLoader example:

schema_hints = 'elementData.element.data MAP<STRING, STRUCT<dataPoint: MAP<STRING, STRING>, values: ARRAY<MAP<STRING, STRING>>>>'

df = (spark.readStream
    .format("cloudFiles").option("cloudFiles.format", "json")
    .option("cloudFiles.inferColumnTypes", "true")
    .option("cloudFiles.schemaHints", schema_hints)
    .option("multiLine", "true")
    .option("cloudFiles.schemaLocation", f"{raw_path}/schema")
    .option("cloudFiles.schemaEvolutionMode", "rescue")
    .load(f"{landing_path}/*/data")
    .select("*", "_metadata")
)

###########################################################
# do or do not transform dataframe (i.e. flatten json data)
###########################################################

query = (df.writeStream
    .outputMode("append")
    .trigger(availableNow=True)
    .option("checkpointLocation", f"{raw_path}/checkpoint")
    .start(f"{raw_path}/data")
)

query.awaitTermination()

 

1 ACCEPTED SOLUTION

Accepted Solutions

BS_THE_ANALYST
Esteemed Contributor II

I think the key thing with holding the raw data in a table, and not transforming that table, is that you have more flexibility at your disposal. There's a great resource available via Databricks Docs for best practices in the Lakehouse. I'd highly recommend checking it out, and in particular, this section: https://docs.databricks.com/aws/en/lakehouse-architecture/reliability/best-practices#2-manage-data-q... 

BS_THE_ANALYST_1-1759903199355.png

@ChristianRRL there is no one size fits all though, it'll depend on your use case. If is a simple use case, and you won't require as much flexibility, the simple option will suffice, of course. No need to overengineer every problem 👍.

All the best,
BS

View solution in original post

1 REPLY 1

BS_THE_ANALYST
Esteemed Contributor II

I think the key thing with holding the raw data in a table, and not transforming that table, is that you have more flexibility at your disposal. There's a great resource available via Databricks Docs for best practices in the Lakehouse. I'd highly recommend checking it out, and in particular, this section: https://docs.databricks.com/aws/en/lakehouse-architecture/reliability/best-practices#2-manage-data-q... 

BS_THE_ANALYST_1-1759903199355.png

@ChristianRRL there is no one size fits all though, it'll depend on your use case. If is a simple use case, and you won't require as much flexibility, the simple option will suffice, of course. No need to overengineer every problem 👍.

All the best,
BS