cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Trouble referencing a column that has been added by schema evolution (Auto Loader with Delta Live Tables)

ilarsen
Contributor

Hi,

I have a Delta Live Tables pipeline, using Auto Loader, to ingest from JSON files. I need to do some transformations - in this case, converting timestamps. Except one of the timestamp columns does not exist in every file. This is causing the DLT pipeline to fail with the error that the column does not exist.

I should preface this with "I'm pretty new to Databricks, Auto Loader and DLT". I'd say I've got the basics, and relying on the docs and online training.

My current approach is for the "bronze" layer to include these transformations. So I have conversion code in the DLT declaration, operating against cloudfiles(). Idea being that our raw bronze data set is still raw, but has timestamps in timezones that gets around the data provider's limitations, and are relevant for our locales and use cases right from the start.

The problem makes sense, in that the timestamp in question doesn't exist for every file, so is not in the schema when being read by Auto Loader. My (probably incorrect) assumption is that with Auto Loader and DLT and automatic schema evolution, I'd never again see this kind of error - away with you, broken SSIS metadata error message!

But what are my options? Is it code in the DLT declaration that I'm missing? Or is it more of an architectural problem - I should just load a raw/bronze table as-is and then do conversions into another table afterwards?

Here's a snippet of what I'm currently trying.

CREATE OR REFRESH STREAMING LIVE TABLE <raw delta live table>
AS
  SELECT
	<timestamps converted>.*
    <convertedTimestamp from below referenced by several columns here for different outputs>
  FROM
  (
    SELECT
      <raw files>.*
      <conversion code, e.g: from_utc_timestamp(<raw files>.<timestamp>, <timezone>)> as convertedTimestamp
    FROM cloud_files(
      "dbfs:<location>",
      "json"
    ) AS <raw files>
 
  ) AS <timezone converted>
    INNER JOIN <another table>

Looking for ideas and advice, please 🙂

Thanks

0 REPLIES 0
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.