cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Handling Unknown Fields in DLT Pipeline

mikeagicman
New Contributor

Hi
I'm working on a DLT pipeline where I read JSON files stored in S3.
I'm using the auto loader to identify the file schema and adding schema hints for some fields to specify their type.
When running it against a single data file that contains additional fields beyond the schema hint,
I encounter the following error: 'terminated with exception: [UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_RECORD_WITH_FILE_PATH] Encountered unknown fields during parsing.'
After that, I get a list of the additional fields that were identified and do not appear in the schema hint, along with a recommendation: 'which can be fixed by an automatic retry: false.'
What does 'automatic retry: false' mean? I've tried various start and restart methods, but it still doesn't work.

Even though I've set the `inferColumnTypes` option to true and additionally set `schemaEvolutionMode` to `addNewColumns`, even though it's the default.
I've tried the same thing in another pipeline with a slightly less complex file, and it worked great, identifying all the fields that weren't in the schema hint.
But here, with a bit more complexity, it's causing me trouble.

I'd appreciate any help you can provide - thank you very much!

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @mikeagicmanWhen you encounter the error message 'terminated with exception: [UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_RECORD_WITH_FILE_PATH] Encountered unknown fields during parsing.', it means that the data file contains fields that are not defined in your schema hint. These additional fields are causing the parsing process to fail. The recommendation you received, 'which can be fixed by an automatic retry: false.', indicates that the system will not automatically retry processing the file after encountering this error. In other words, it won’t make another attempt to parse the data with the same schema hint. Instead, it expects you to address the issue manually.

You’ve already set inferColumnTypes to true and schemaEvolutionMode to addNewColumns. However, in this specific case, it seems that the complexity of the data file is causing trouble.

Let’s explore some potential solutions:

  • Review the Schema Hint: Double-check your schema hint. Ensure that it accurately reflects the fields present in the data file. Sometimes, a missing or incorrect field name in the hint can lead to this error.

  • Inspect the Additional Fields: Look at the list of additional fields that were identified. Are they truly new fields, or are they variations of existing fields? Sometimes, small differences (e.g., case sensitivity, underscores, or spaces) can cause issues.

  • Explicitly Define New Fields: If the schema hint doesn’t cover all the fields in your data, consider explicitly defining the new fields. You can add them to the schema hint or handle them separately during processing.

  • Custom Handling for Unknown Fields: Implement custom logic to handle unknown fields. For example, you could log them, ignore them, or dynamically adjust the schema based on the encountered fields.

  • Retry with a Simplified File: Since your other pipeline worked well with a less complex file, try simplifying the problematic file. Remove some fields or reduce its complexity to see if it resolves the issue.

  • Check the logs for more detailed error messages.
  • Verify that the data file is correctly formatted as JSON.
  • Inspect the actual data in the file to identify any unexpected fields.

Good luck, and I hope this helps you resolve the issue!