<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: schema check in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/schema-check/m-p/149873#M53191</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/218531"&gt;@neerajaN&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;You are right. Job 5 is Schema Inference job. You can identify Job 5 as a schema/header inference job because it triggers immediately upon &lt;CODE data-index-in-node="94" data-path-to-node="13,0"&gt;spark.read&lt;/CODE&gt;. Since &lt;CODE data-index-in-node="112" data-path-to-node="13,0"&gt;header=True&lt;/CODE&gt; is set without a manual &lt;CODE data-index-in-node="148" data-path-to-node="13,0"&gt;.schema()&lt;/CODE&gt;, Spark must launch a job to look at the file headers before it can define the DataFrame. The subsequent Job 6 is the actual &lt;CODE data-index-in-node="284" data-path-to-node="13,0"&gt;count()&lt;/CODE&gt; action of processing the full data.&lt;/P&gt;
&lt;P&gt;To give you more confidence... and to verify this is a schema inference job for sure, check the SQL tab in the Spark UI for Job 5. You'll see a &lt;CODE data-index-in-node="101" data-path-to-node="10,0"&gt;FileScan&lt;/CODE&gt; operation that finishes almost instantly. Because &lt;CODE data-index-in-node="160" data-path-to-node="10,0"&gt;header=True&lt;/CODE&gt; is set without a manual schema, Spark triggers this job to resolve column names before it can even define the DataFrame.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To your question about who does the work, it will be the Driver who coordinates, but the Executors perform the actual read. Spark launches a small job where executors read the first few bytes/rows of the files to determine the schema. Once the schema is known, your action &lt;CODE data-index-in-node="57" data-path-to-node="6,1,0"&gt;df.count()&lt;/CODE&gt; triggers the actual distributed processing of the entire dataset.&lt;/P&gt;
&lt;P&gt;It's the same concept I explained in response to one of your other &lt;A href="https://community.databricks.com/t5/data-engineering/count-function/m-p/149761#M53175" target="_self"&gt;posts&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;To eliminate Job 5 and speed up your pipeline, you can&amp;nbsp;provide a manual schema. This allows Spark to remain 100% lazy (wait to execute until absolutely necessary) until the count is called.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;References:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;&lt;FONT size="2"&gt;&lt;SPAN&gt;Databricks Documentation: &lt;/SPAN&gt;&lt;A style="font-size: small; background-color: #ffffff;" href="https://www.google.com/search?q=https://docs.databricks.com/en/tables/index.html%23provide-a-schema" rel="noopener" target="_blank"&gt;Optimization: Providing a Schema&lt;/A&gt;&lt;/FONT&gt;&lt;/LI&gt;
&lt;LI data-path-to-node="12,0,0"&gt;&lt;FONT size="2"&gt;Apache Spark Guide: &lt;A href="https://spark.apache.org/docs/latest/sql-data-sources-csv.html" rel="noopener" target="_blank"&gt;&lt;FONT size="2"&gt;Spark SQL Data Sources - CSV&lt;/FONT&gt;&lt;/A&gt;&lt;/FONT&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-path-to-node="12,1,0"&gt;&lt;FONT size="2"&gt;&lt;SPAN&gt;Note: Look for the section on &lt;/SPAN&gt;&lt;SPAN&gt;samplingRatio&lt;/SPAN&gt;&lt;SPAN&gt; and &lt;/SPAN&gt;&lt;SPAN&gt;inferSchema&lt;/SPAN&gt;&lt;SPAN&gt; which explains the performance trade-offs of schema discovery.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P data-path-to-node="12,1,0"&gt;&lt;FONT color="#FF6600"&gt;&lt;STRONG&gt;&lt;FONT size="2"&gt;&lt;SPAN&gt;If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 05 Mar 2026 06:59:17 GMT</pubDate>
    <dc:creator>Ashwin_DSA</dc:creator>
    <dc:date>2026-03-05T06:59:17Z</dc:date>
    <item>
      <title>schema check</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-check/m-p/149869#M53190</link>
      <description>&lt;P&gt;hi , i am running the below query in databricks , first job5 created with 10 partitions .&lt;/P&gt;&lt;P&gt;and again job6 started where actual processing started.&lt;/P&gt;&lt;P&gt;in job5 is it identifying schema , when schema check will be done for the new dataset. is it checked by driver or any one of the executor before actual process starts.?&lt;/P&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="schema check.png" style="width: 860px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/24575i0B4FA11A8564C885/image-size/large?v=v2&amp;amp;px=999" role="button" title="schema check.png" alt="schema check.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 05 Mar 2026 06:01:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-check/m-p/149869#M53190</guid>
      <dc:creator>neerajaN</dc:creator>
      <dc:date>2026-03-05T06:01:19Z</dc:date>
    </item>
    <item>
      <title>Re: schema check</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-check/m-p/149873#M53191</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/218531"&gt;@neerajaN&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;You are right. Job 5 is Schema Inference job. You can identify Job 5 as a schema/header inference job because it triggers immediately upon &lt;CODE data-index-in-node="94" data-path-to-node="13,0"&gt;spark.read&lt;/CODE&gt;. Since &lt;CODE data-index-in-node="112" data-path-to-node="13,0"&gt;header=True&lt;/CODE&gt; is set without a manual &lt;CODE data-index-in-node="148" data-path-to-node="13,0"&gt;.schema()&lt;/CODE&gt;, Spark must launch a job to look at the file headers before it can define the DataFrame. The subsequent Job 6 is the actual &lt;CODE data-index-in-node="284" data-path-to-node="13,0"&gt;count()&lt;/CODE&gt; action of processing the full data.&lt;/P&gt;
&lt;P&gt;To give you more confidence... and to verify this is a schema inference job for sure, check the SQL tab in the Spark UI for Job 5. You'll see a &lt;CODE data-index-in-node="101" data-path-to-node="10,0"&gt;FileScan&lt;/CODE&gt; operation that finishes almost instantly. Because &lt;CODE data-index-in-node="160" data-path-to-node="10,0"&gt;header=True&lt;/CODE&gt; is set without a manual schema, Spark triggers this job to resolve column names before it can even define the DataFrame.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To your question about who does the work, it will be the Driver who coordinates, but the Executors perform the actual read. Spark launches a small job where executors read the first few bytes/rows of the files to determine the schema. Once the schema is known, your action &lt;CODE data-index-in-node="57" data-path-to-node="6,1,0"&gt;df.count()&lt;/CODE&gt; triggers the actual distributed processing of the entire dataset.&lt;/P&gt;
&lt;P&gt;It's the same concept I explained in response to one of your other &lt;A href="https://community.databricks.com/t5/data-engineering/count-function/m-p/149761#M53175" target="_self"&gt;posts&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;To eliminate Job 5 and speed up your pipeline, you can&amp;nbsp;provide a manual schema. This allows Spark to remain 100% lazy (wait to execute until absolutely necessary) until the count is called.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;References:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;&lt;FONT size="2"&gt;&lt;SPAN&gt;Databricks Documentation: &lt;/SPAN&gt;&lt;A style="font-size: small; background-color: #ffffff;" href="https://www.google.com/search?q=https://docs.databricks.com/en/tables/index.html%23provide-a-schema" rel="noopener" target="_blank"&gt;Optimization: Providing a Schema&lt;/A&gt;&lt;/FONT&gt;&lt;/LI&gt;
&lt;LI data-path-to-node="12,0,0"&gt;&lt;FONT size="2"&gt;Apache Spark Guide: &lt;A href="https://spark.apache.org/docs/latest/sql-data-sources-csv.html" rel="noopener" target="_blank"&gt;&lt;FONT size="2"&gt;Spark SQL Data Sources - CSV&lt;/FONT&gt;&lt;/A&gt;&lt;/FONT&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-path-to-node="12,1,0"&gt;&lt;FONT size="2"&gt;&lt;SPAN&gt;Note: Look for the section on &lt;/SPAN&gt;&lt;SPAN&gt;samplingRatio&lt;/SPAN&gt;&lt;SPAN&gt; and &lt;/SPAN&gt;&lt;SPAN&gt;inferSchema&lt;/SPAN&gt;&lt;SPAN&gt; which explains the performance trade-offs of schema discovery.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P data-path-to-node="12,1,0"&gt;&lt;FONT color="#FF6600"&gt;&lt;STRONG&gt;&lt;FONT size="2"&gt;&lt;SPAN&gt;If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 05 Mar 2026 06:59:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-check/m-p/149873#M53191</guid>
      <dc:creator>Ashwin_DSA</dc:creator>
      <dc:date>2026-03-05T06:59:17Z</dc:date>
    </item>
  </channel>
</rss>

