cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

FIELD_NOT_FOUND when selecting field not part of original schema

Magnus
Contributor

Hi,

I'm implementing a DLT pipeline using Auto Loader to ingest json files. The json files contains an array called Items that contains records and two of the fields in the records wasn't part of the original schema, but has been added later. Auto Loader has updated the schema and when I query the bronze table in Databricks SQL, I can select the two added fields, and they are NULL in the data that was loaded before they existed in the json files and contains values in more recent data.

However, in the notebook that creates the silver table, I get an error when running the DLT pipeline with a message saying "Unable to process statement for View 'VehicleInspectionProtocol_Flattened_Temp'. [FIELD_NOT_FOUND] No such struct field `RegistrationDate` in ...". RegistrationDate was one of the fields not part of the records initially. Can I implement the notebook in a better way?

 

CREATE TEMPORARY STREAMING LIVE VIEW VehicleInspectionProtocol_Explode_Temp
AS
SELECT 
  explode(Items) as Item,
  AuditSourceFile,
  AuditLoadTime
FROM STREAM(LIVE.vehicleInspectionProtocol_Raw);

CREATE TEMPORARY STREAMING LIVE VIEW VehicleInspectionProtocol_Flattened_Temp
AS
SELECT
  Item.Id,
  Item.RegistrationDate,
  Item.InspectionDateTime,
  Item.VehicleInspectionType,
  AuditSourceFile,
  AuditLoadTime
FROM STREAM(LIVE.VehicleInspectionProtocol_Explode_Temp);

CREATE OR REFRESH STREAMING LIVE TABLE VehicleInspectionProtocol;

APPLY CHANGES INTO
  LIVE.VehicleInspectionProtocol
FROM
  STREAM(LIVE.VehicleInspectionProtocol_Flattened_Temp)
KEYS
  (Id)
SEQUENCE BY
  AuditSourceFile;

 

Thanks,
Magnus

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @Magnus , It seems you’re encountering an issue with schema evolution in your DLT pipeline using Auto Loader. 

 

Let’s explore how you can improve your notebook implementation.

 

Schema Inference and Evolution:

Auto Loader can automatically detect the schema of loaded data, allowing you to initialize tables without explicitly declaring the data schema and evolve the table schema as new columns are introduced.

When first reading data, Auto Loader samples the first 50 GB or 1000 files (whichever limit is crossed first) to infer the schema. It stores the schema information in a directory called _schemas at the configured cloudFiles.schemaLocation.

For formats that don’t encode data types (such as JSON and CSV), Auto Loader infers all columns as strings. For formats with typed schema (like Parquet and Avro), it samples a subset of files and merges the s....

 

To configure schema inference and evolution, specify a target directory for the option cloudFiles.schemaLocation. You can use the same directory you specify for the checkpointLocation. If you’re using Delta Live Tables, Databricks manages schema location and other checkpoint information automatically.

 

 

Error Message:

  • The error message you’re encountering (“Unable to process statement for View ‘VehicleInspectionProtocol_Flattened_Temp’. [FIELD_NOT_FOUND] No such struct field RegistrationDate in …”) suggests that the field RegistrationDate is missing in the schema during processing.
  • Since RegistrationDate was not part of the original records, it might not be inferred correctly by Auto Loader.
  • To address this, consider the following steps:

Suggestions:

 

Explicit Schema Definition:

Instead of relying solely on schema inference, explicitly define the schema for your flattened view (VehicleInspectionProtocol_Flattened_Temp). This way, you can ensure that all expected fields are included.

Define the schema using StructType and specify the data types for each field.

 

Check Data Consistency:

  • Ensure that the data in your JSON files consistently includes the RegistrationDate field.
  • Verify that the schema evolution process is correctly updating the schema when new columns are introduced.

Review and Debug:

  • Double-check your data sources, schema inference settings, and the actual data in your JSON files.
  • Debug any issues related to schema evolution and ensure that the schema updates are being applied correctly.

Remember that schema evolution can be complex, especially when dealing with semi-structured data. Explicitly defining the schema and verifying data consistency will help you avoid unexpected errors during processing. 

 

Good luck with your DLT pipeline! 🚀

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Magnus , It seems you’re encountering an issue with schema evolution in your DLT pipeline using Auto Loader. 

 

Let’s explore how you can improve your notebook implementation.

 

Schema Inference and Evolution:

Auto Loader can automatically detect the schema of loaded data, allowing you to initialize tables without explicitly declaring the data schema and evolve the table schema as new columns are introduced.

When first reading data, Auto Loader samples the first 50 GB or 1000 files (whichever limit is crossed first) to infer the schema. It stores the schema information in a directory called _schemas at the configured cloudFiles.schemaLocation.

For formats that don’t encode data types (such as JSON and CSV), Auto Loader infers all columns as strings. For formats with typed schema (like Parquet and Avro), it samples a subset of files and merges the s....

 

To configure schema inference and evolution, specify a target directory for the option cloudFiles.schemaLocation. You can use the same directory you specify for the checkpointLocation. If you’re using Delta Live Tables, Databricks manages schema location and other checkpoint information automatically.

 

 

Error Message:

  • The error message you’re encountering (“Unable to process statement for View ‘VehicleInspectionProtocol_Flattened_Temp’. [FIELD_NOT_FOUND] No such struct field RegistrationDate in …”) suggests that the field RegistrationDate is missing in the schema during processing.
  • Since RegistrationDate was not part of the original records, it might not be inferred correctly by Auto Loader.
  • To address this, consider the following steps:

Suggestions:

 

Explicit Schema Definition:

Instead of relying solely on schema inference, explicitly define the schema for your flattened view (VehicleInspectionProtocol_Flattened_Temp). This way, you can ensure that all expected fields are included.

Define the schema using StructType and specify the data types for each field.

 

Check Data Consistency:

  • Ensure that the data in your JSON files consistently includes the RegistrationDate field.
  • Verify that the schema evolution process is correctly updating the schema when new columns are introduced.

Review and Debug:

  • Double-check your data sources, schema inference settings, and the actual data in your JSON files.
  • Debug any issues related to schema evolution and ensure that the schema updates are being applied correctly.

Remember that schema evolution can be complex, especially when dealing with semi-structured data. Explicitly defining the schema and verifying data consistency will help you avoid unexpected errors during processing. 

 

Good luck with your DLT pipeline! 🚀

I used a schema hint in the Auto Loader notebook to explicitly define the two fields that caused the error. Then it worked. Thanks @Kaniz !

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!