<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Delta Live Tables consuming different files from the same path are combining the schema in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-consuming-different-files-from-the-same-path/m-p/54358#M30059</link>
    <description>&lt;H3&gt;&lt;STRONG&gt;Summary&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;I am using Delta Live Tables to create a pipeline in Databricks and I am facing a problem of merging the schema of different files that are placed in the same folder in a datalake, even though I am using File Patterns to separate the data ingestion.&lt;/P&gt;&lt;H3&gt;Details&lt;/H3&gt;&lt;P&gt;I have two files in the same path of a storage account. Consider &lt;STRONG&gt;file_1&lt;/STRONG&gt;&amp;nbsp;and &lt;STRONG&gt;file_2&lt;/STRONG&gt;. They are placed in the Abfss location like:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;abfss://&amp;lt;container&amp;gt;@&amp;lt;storage_account&amp;gt;.dfs.core.windows.net/path/to/folder/&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;The &lt;STRONG&gt;file_1&lt;/STRONG&gt; has a schema like:&lt;/P&gt;&lt;P&gt;- column_a&lt;BR /&gt;- column_b&lt;BR /&gt;- column_c&lt;/P&gt;&lt;P&gt;The &lt;STRONG&gt;file_2&lt;/STRONG&gt;&amp;nbsp;has a schema like:&lt;/P&gt;&lt;P&gt;- column_d&lt;BR /&gt;- column_e&lt;BR /&gt;- column_f&lt;/P&gt;&lt;P&gt;The code that I am applying to create the Delta Live Tables is the following one:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;CREATE OR REFRESH STREAMING LIVE TABLE table_number_1 (
CONSTRAINT correct_schema EXPECT (_rescued_data IS NULL)
)
PARTITIONED BY (_JOB_UPDATED_PARTITION_YEAR, _JOB_UPDATED_PARTITION_MONTH)
COMMENT "Comment for table 1"
TBLPROPERTIES (
-- Quality flag
'mypipeline.quality' = 'bronze',

-- Delta Live Tables table properties
'pipelines.autoOptimize.managed' = 'true',
'pipelines.reset.allowed' = 'true',
'pipelines.autoOptimize.zOrderCols' = '_JOB_UPDATED_PARTITION_YEAR, _JOB_UPDATED_PARTITION_MONTH',

-- delta table options
'delta.enableDeletionVectors' = 'true',
'delta.deletedFileRetentionDuration' = '30 days',
'delta.enableChangeDataFeed' = 'true',
'delta.feature.timestampNtz' = 'supported',

-- enables special characters in the table
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion'= '5'
)
AS
SELECT
*,
_metadata.file_path as _JOB_SOURCE_FILE,
_metadata.file_name as _FILE_NAME,
CONVERT_TIMEZONE('UTC', 'America/Belem', _metadata.file_modification_time) as _MODIFICATION_TIME_BR,
CONVERT_TIMEZONE('UTC', 'America/Belem', current_timestamp()) as _JOB_UPDATED_TIME_BR,
year(_JOB_UPDATED_TIME_BR) AS _JOB_UPDATED_PARTITION_YEAR,
month(_JOB_UPDATED_TIME_BR) AS _JOB_UPDATED_PARTITION_MONTH,
'Datalake' AS _DATA_ORIGIN
FROM cloud_files(
"abfss://&amp;lt;container&amp;gt;@&amp;lt;storage_account&amp;gt;.dfs.core.windows.net/path/to/folder/",
'csv',
map(
-- Generic options
'fileNamePattern', '*file_1*',

-- Common Auto Loader options
'cloudFiles.allowOverwrites', 'true',
'cloudFiles.inferColumnTypes', 'true',
'cloudFiles.maxFilesPerTrigger', '1',

-- CSV file format options
'ignoreLeadingWhiteSpace', 'true',
'ignoreTrailingWhiteSpace', 'true',
'encoding', 'UTF-8',
'header', 'true',
'mode', 'FAILFAST',
'multiLine', 'false',
'readerCaseSensitive', 'false',
'delimiter', ',',
'skipRows', '0',

-- Escape config
'quote', '\"',
'escape', '\"',
'unescapedQuoteHandling', 'BACK_TO_DELIMITER'
)
);&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In this code, I am referring to the ABFSS path using the function &lt;STRONG&gt;cloud_files()&lt;/STRONG&gt; [&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/delta-live-tables/sql-ref#--auto-loader-sql-syntax" target="_self"&gt;link&lt;/A&gt;] but as the folder has different files I am applying a &lt;STRONG&gt;fileNamePattern&lt;/STRONG&gt;&amp;nbsp;like an example &lt;STRONG&gt;'*file_1*'&lt;/STRONG&gt;. According to the &lt;STRONG&gt;Auto Loader&lt;/STRONG&gt;&amp;nbsp;documentation [&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/options#generic-options" target="_self"&gt;link&lt;/A&gt;] this is a generic option to provide a pattern for choosing files in a folder.&lt;/P&gt;&lt;P&gt;The problem is that when I start the DLT Pipeline to create the two tables, both of them are materialized with the following schema:&lt;/P&gt;&lt;P&gt;- column_a&lt;BR /&gt;- column_b&lt;BR /&gt;- column_c&lt;BR /&gt;- column_d&lt;BR /&gt;- column_e&lt;BR /&gt;- column_f&lt;/P&gt;&lt;P&gt;That is, somehow the DLT (Delta Live Table) when inferring the schema is combining all the files in the folder instead of using the &lt;STRONG&gt;fileNamePattern&lt;/STRONG&gt;&amp;nbsp;provided by me inside the &lt;STRONG&gt;cloud_files&lt;/STRONG&gt;&amp;nbsp;function.&lt;/P&gt;&lt;P&gt;That is strange because if I use `Auto Loader` instead of `DLT` via &lt;STRONG&gt;read_files()&lt;/STRONG&gt;&amp;nbsp;[&lt;A href="https://docs.databricks.com/en/sql/language-manual/functions/read_files.html#read_files-table-valued-function" target="_self"&gt;link&lt;/A&gt;] function then the tables are created in the expected way.&lt;/P&gt;&lt;H3&gt;The main question&lt;/H3&gt;&lt;P&gt;Does someone know how to deal with this merging schema problem?&lt;/P&gt;</description>
    <pubDate>Thu, 30 Nov 2023 16:06:02 GMT</pubDate>
    <dc:creator>Arnold_Souza</dc:creator>
    <dc:date>2023-11-30T16:06:02Z</dc:date>
    <item>
      <title>Delta Live Tables consuming different files from the same path are combining the schema</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-consuming-different-files-from-the-same-path/m-p/54358#M30059</link>
      <description>&lt;H3&gt;&lt;STRONG&gt;Summary&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;I am using Delta Live Tables to create a pipeline in Databricks and I am facing a problem of merging the schema of different files that are placed in the same folder in a datalake, even though I am using File Patterns to separate the data ingestion.&lt;/P&gt;&lt;H3&gt;Details&lt;/H3&gt;&lt;P&gt;I have two files in the same path of a storage account. Consider &lt;STRONG&gt;file_1&lt;/STRONG&gt;&amp;nbsp;and &lt;STRONG&gt;file_2&lt;/STRONG&gt;. They are placed in the Abfss location like:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;abfss://&amp;lt;container&amp;gt;@&amp;lt;storage_account&amp;gt;.dfs.core.windows.net/path/to/folder/&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;The &lt;STRONG&gt;file_1&lt;/STRONG&gt; has a schema like:&lt;/P&gt;&lt;P&gt;- column_a&lt;BR /&gt;- column_b&lt;BR /&gt;- column_c&lt;/P&gt;&lt;P&gt;The &lt;STRONG&gt;file_2&lt;/STRONG&gt;&amp;nbsp;has a schema like:&lt;/P&gt;&lt;P&gt;- column_d&lt;BR /&gt;- column_e&lt;BR /&gt;- column_f&lt;/P&gt;&lt;P&gt;The code that I am applying to create the Delta Live Tables is the following one:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;CREATE OR REFRESH STREAMING LIVE TABLE table_number_1 (
CONSTRAINT correct_schema EXPECT (_rescued_data IS NULL)
)
PARTITIONED BY (_JOB_UPDATED_PARTITION_YEAR, _JOB_UPDATED_PARTITION_MONTH)
COMMENT "Comment for table 1"
TBLPROPERTIES (
-- Quality flag
'mypipeline.quality' = 'bronze',

-- Delta Live Tables table properties
'pipelines.autoOptimize.managed' = 'true',
'pipelines.reset.allowed' = 'true',
'pipelines.autoOptimize.zOrderCols' = '_JOB_UPDATED_PARTITION_YEAR, _JOB_UPDATED_PARTITION_MONTH',

-- delta table options
'delta.enableDeletionVectors' = 'true',
'delta.deletedFileRetentionDuration' = '30 days',
'delta.enableChangeDataFeed' = 'true',
'delta.feature.timestampNtz' = 'supported',

-- enables special characters in the table
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion'= '5'
)
AS
SELECT
*,
_metadata.file_path as _JOB_SOURCE_FILE,
_metadata.file_name as _FILE_NAME,
CONVERT_TIMEZONE('UTC', 'America/Belem', _metadata.file_modification_time) as _MODIFICATION_TIME_BR,
CONVERT_TIMEZONE('UTC', 'America/Belem', current_timestamp()) as _JOB_UPDATED_TIME_BR,
year(_JOB_UPDATED_TIME_BR) AS _JOB_UPDATED_PARTITION_YEAR,
month(_JOB_UPDATED_TIME_BR) AS _JOB_UPDATED_PARTITION_MONTH,
'Datalake' AS _DATA_ORIGIN
FROM cloud_files(
"abfss://&amp;lt;container&amp;gt;@&amp;lt;storage_account&amp;gt;.dfs.core.windows.net/path/to/folder/",
'csv',
map(
-- Generic options
'fileNamePattern', '*file_1*',

-- Common Auto Loader options
'cloudFiles.allowOverwrites', 'true',
'cloudFiles.inferColumnTypes', 'true',
'cloudFiles.maxFilesPerTrigger', '1',

-- CSV file format options
'ignoreLeadingWhiteSpace', 'true',
'ignoreTrailingWhiteSpace', 'true',
'encoding', 'UTF-8',
'header', 'true',
'mode', 'FAILFAST',
'multiLine', 'false',
'readerCaseSensitive', 'false',
'delimiter', ',',
'skipRows', '0',

-- Escape config
'quote', '\"',
'escape', '\"',
'unescapedQuoteHandling', 'BACK_TO_DELIMITER'
)
);&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In this code, I am referring to the ABFSS path using the function &lt;STRONG&gt;cloud_files()&lt;/STRONG&gt; [&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/delta-live-tables/sql-ref#--auto-loader-sql-syntax" target="_self"&gt;link&lt;/A&gt;] but as the folder has different files I am applying a &lt;STRONG&gt;fileNamePattern&lt;/STRONG&gt;&amp;nbsp;like an example &lt;STRONG&gt;'*file_1*'&lt;/STRONG&gt;. According to the &lt;STRONG&gt;Auto Loader&lt;/STRONG&gt;&amp;nbsp;documentation [&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/options#generic-options" target="_self"&gt;link&lt;/A&gt;] this is a generic option to provide a pattern for choosing files in a folder.&lt;/P&gt;&lt;P&gt;The problem is that when I start the DLT Pipeline to create the two tables, both of them are materialized with the following schema:&lt;/P&gt;&lt;P&gt;- column_a&lt;BR /&gt;- column_b&lt;BR /&gt;- column_c&lt;BR /&gt;- column_d&lt;BR /&gt;- column_e&lt;BR /&gt;- column_f&lt;/P&gt;&lt;P&gt;That is, somehow the DLT (Delta Live Table) when inferring the schema is combining all the files in the folder instead of using the &lt;STRONG&gt;fileNamePattern&lt;/STRONG&gt;&amp;nbsp;provided by me inside the &lt;STRONG&gt;cloud_files&lt;/STRONG&gt;&amp;nbsp;function.&lt;/P&gt;&lt;P&gt;That is strange because if I use `Auto Loader` instead of `DLT` via &lt;STRONG&gt;read_files()&lt;/STRONG&gt;&amp;nbsp;[&lt;A href="https://docs.databricks.com/en/sql/language-manual/functions/read_files.html#read_files-table-valued-function" target="_self"&gt;link&lt;/A&gt;] function then the tables are created in the expected way.&lt;/P&gt;&lt;H3&gt;The main question&lt;/H3&gt;&lt;P&gt;Does someone know how to deal with this merging schema problem?&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2023 16:06:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-tables-consuming-different-files-from-the-same-path/m-p/54358#M30059</guid>
      <dc:creator>Arnold_Souza</dc:creator>
      <dc:date>2023-11-30T16:06:02Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Live Tables consuming different files from the same path are combining the schema</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-consuming-different-files-from-the-same-path/m-p/54365#M30063</link>
      <description>&lt;P&gt;Found a solution:&lt;/P&gt;&lt;P&gt;Never use&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;'fileNamePattern', '*file_1*',&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Instead, put the pattern directly into the path:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;"abfss://&amp;lt;container&amp;gt;@&amp;lt;storage_account&amp;gt;.dfs.core.windows.net/path/to/folder/*file_1*"&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2023 18:57:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-tables-consuming-different-files-from-the-same-path/m-p/54365#M30063</guid>
      <dc:creator>Arnold_Souza</dc:creator>
      <dc:date>2023-11-30T18:57:49Z</dc:date>
    </item>
  </channel>
</rss>

