cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Trouble referencing a column that has been added by schema evolution (Auto Loader with Delta Live Tables)

ilarsen
Contributor

Hi,

I have a Delta Live Tables pipeline, using Auto Loader, to ingest from JSON files. I need to do some transformations - in this case, converting timestamps. Except one of the timestamp columns does not exist in every file. This is causing the DLT pipeline to fail with the error that the column does not exist.

I should preface this with "I'm pretty new to Databricks, Auto Loader and DLT". I'd say I've got the basics, and relying on the docs and online training.

My current approach is for the "bronze" layer to include these transformations. So I have conversion code in the DLT declaration, operating against cloudfiles(). Idea being that our raw bronze data set is still raw, but has timestamps in timezones that gets around the data provider's limitations, and are relevant for our locales and use cases right from the start.

The problem makes sense, in that the timestamp in question doesn't exist for every file, so is not in the schema when being read by Auto Loader. My (probably incorrect) assumption is that with Auto Loader and DLT and automatic schema evolution, I'd never again see this kind of error - away with you, broken SSIS metadata error message!

But what are my options? Is it code in the DLT declaration that I'm missing? Or is it more of an architectural problem - I should just load a raw/bronze table as-is and then do conversions into another table afterwards?

Here's a snippet of what I'm currently trying.

CREATE OR REFRESH STREAMING LIVE TABLE <raw delta live table>
AS
  SELECT
	<timestamps converted>.*
    <convertedTimestamp from below referenced by several columns here for different outputs>
  FROM
  (
    SELECT
      <raw files>.*
      <conversion code, e.g: from_utc_timestamp(<raw files>.<timestamp>, <timezone>)> as convertedTimestamp
    FROM cloud_files(
      "dbfs:<location>",
      "json"
    ) AS <raw files>
 
  ) AS <timezone converted>
    INNER JOIN <another table>

Looking for ideas and advice, please 🙂

Thanks

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group