<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic AutoLoader Ingestion Best Practice in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/autoloader-ingestion-best-practice/m-p/134118#M10814</link>
    <description>&lt;P&gt;Hi there, I would appreciate some input on AutoLoader best practice. I've read that some people recommend that the latest data should be loaded in its rawest form into a raw delta table (i.e. highly nested json-like schema) and from that data the appropriate flattening/upserting into the "actual" target tables should be performed. But it may be argued that this is overcomplicating things and it may be better to simply generate a dataframe with the latest non-processed data (i.e. via AutoLoader or read.spark), apply the relevant transformations, and then load into the proper target tables.&lt;/P&gt;&lt;P&gt;Are there strong reasons to pick one approach over the other?&lt;/P&gt;&lt;P&gt;AutoLoader example:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;schema_hints = 'elementData.element.data MAP&amp;lt;STRING, STRUCT&amp;lt;dataPoint: MAP&amp;lt;STRING, STRING&amp;gt;, values: ARRAY&amp;lt;MAP&amp;lt;STRING, STRING&amp;gt;&amp;gt;&amp;gt;&amp;gt;'

df = (spark.readStream
    .format("cloudFiles").option("cloudFiles.format", "json")
    .option("cloudFiles.inferColumnTypes", "true")
    .option("cloudFiles.schemaHints", schema_hints)
    .option("multiLine", "true")
    .option("cloudFiles.schemaLocation", f"{raw_path}/schema")
    .option("cloudFiles.schemaEvolutionMode", "rescue")
    .load(f"{landing_path}/*/data")
    .select("*", "_metadata")
)

###########################################################
# do or do not transform dataframe (i.e. flatten json data)
###########################################################

query = (df.writeStream
    .outputMode("append")
    .trigger(availableNow=True)
    .option("checkpointLocation", f"{raw_path}/checkpoint")
    .start(f"{raw_path}/data")
)

query.awaitTermination()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 07 Oct 2025 20:54:31 GMT</pubDate>
    <dc:creator>ChristianRRL</dc:creator>
    <dc:date>2025-10-07T20:54:31Z</dc:date>
    <item>
      <title>AutoLoader Ingestion Best Practice</title>
      <link>https://community.databricks.com/t5/get-started-discussions/autoloader-ingestion-best-practice/m-p/134118#M10814</link>
      <description>&lt;P&gt;Hi there, I would appreciate some input on AutoLoader best practice. I've read that some people recommend that the latest data should be loaded in its rawest form into a raw delta table (i.e. highly nested json-like schema) and from that data the appropriate flattening/upserting into the "actual" target tables should be performed. But it may be argued that this is overcomplicating things and it may be better to simply generate a dataframe with the latest non-processed data (i.e. via AutoLoader or read.spark), apply the relevant transformations, and then load into the proper target tables.&lt;/P&gt;&lt;P&gt;Are there strong reasons to pick one approach over the other?&lt;/P&gt;&lt;P&gt;AutoLoader example:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;schema_hints = 'elementData.element.data MAP&amp;lt;STRING, STRUCT&amp;lt;dataPoint: MAP&amp;lt;STRING, STRING&amp;gt;, values: ARRAY&amp;lt;MAP&amp;lt;STRING, STRING&amp;gt;&amp;gt;&amp;gt;&amp;gt;'

df = (spark.readStream
    .format("cloudFiles").option("cloudFiles.format", "json")
    .option("cloudFiles.inferColumnTypes", "true")
    .option("cloudFiles.schemaHints", schema_hints)
    .option("multiLine", "true")
    .option("cloudFiles.schemaLocation", f"{raw_path}/schema")
    .option("cloudFiles.schemaEvolutionMode", "rescue")
    .load(f"{landing_path}/*/data")
    .select("*", "_metadata")
)

###########################################################
# do or do not transform dataframe (i.e. flatten json data)
###########################################################

query = (df.writeStream
    .outputMode("append")
    .trigger(availableNow=True)
    .option("checkpointLocation", f"{raw_path}/checkpoint")
    .start(f"{raw_path}/data")
)

query.awaitTermination()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Oct 2025 20:54:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/autoloader-ingestion-best-practice/m-p/134118#M10814</guid>
      <dc:creator>ChristianRRL</dc:creator>
      <dc:date>2025-10-07T20:54:31Z</dc:date>
    </item>
    <item>
      <title>Re: AutoLoader Ingestion Best Practice</title>
      <link>https://community.databricks.com/t5/get-started-discussions/autoloader-ingestion-best-practice/m-p/134148#M10820</link>
      <description>&lt;P&gt;I think the key thing with holding the raw data in a table, and not transforming that table, is that you have more flexibility at your disposal. There's a great resource available via Databricks Docs for best practices in the Lakehouse. I'd highly recommend checking it out, and in particular, this section:&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/lakehouse-architecture/reliability/best-practices#2-manage-data-quality" target="_blank"&gt;https://docs.databricks.com/aws/en/lakehouse-architecture/reliability/best-practices#2-manage-data-quality&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="BS_THE_ANALYST_1-1759903199355.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20546i3D44AC9D6A016E6A/image-size/large?v=v2&amp;amp;px=999" role="button" title="BS_THE_ANALYST_1-1759903199355.png" alt="BS_THE_ANALYST_1-1759903199355.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/96188"&gt;@ChristianRRL&lt;/a&gt;&amp;nbsp;there is no one size fits all though, it'll depend on your use case. If is a simple use case, and you won't require as much flexibility, the simple option will suffice, of course. No need to overengineer every problem &lt;span class="lia-unicode-emoji" title=":thumbs_up:"&gt;👍&lt;/span&gt;.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Wed, 08 Oct 2025 06:02:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/autoloader-ingestion-best-practice/m-p/134148#M10820</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-10-08T06:02:41Z</dc:date>
    </item>
  </channel>
</rss>

