cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Schema Evolution in Azure databricks

CBL
New Contributor

Hi All -

In my scenario, Loading data from 100 of Json files.

Problem is, fields/columns are missing when JSON file contains new fields.

Full Load:

while writing JSON to delta use the option ("mergeschema", "true") so that we do not miss new columns 

Incremental Load:

Problem is here as schema does not match with existing schema.

Could you please assist with schema comparison while doing incremental load.

New JSON files schema should compare with existing JSON files schema.

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @CBLHandling schema evolution during incremental data loads is crucial to ensure data consistency and prevent issues when new fields are introduced.

Let’s explore some strategies for schema comparison in incremental loads:

  1. Checksum-based Incremental Load:

  2. Schema Evolution and Compatibility:

    • Schema evolution refers to changes in the structure of data over time.
    • When performing incremental loads, consider schema changes such as adding new fields or modifying existing ones.
    • For full loads, you’re already using the "mergeschema" option to handle new columns.
    • For incremental loads, you need to address schema mismatches.
    • Here are some approaches:
      • Backward Compatibility: Ensure that new fields added to the schema are backward compatible with existing data. Existing columns should remain unchanged.
      • Forward Compatibility: Ensure that existing data can handle new fields without breaking.
      • Versioning: Maintain version information for your schema to track changes.
      • Schema Registry: Use a schema registry to manage schema versions and compatibility.
  3. Delta Lake and Schema Evolution:

    • Since you’re using Delta Lake, consider the following:
      • Delta Lake automatically handles schema evolution during writes.
      • When writing data, Delta Lake detects schema changes and updates the table schema.
      • You can use the "mergeschema" option for both full and incremental loads.
      • For incremental loads, Delta Lake ensures that new fields are added to the schema.
  4. Data Validation and Testing:

    • Implement thorough data validation during incremental loads.
    • Write tests to validate the schema compatibility.
    • Test scenarios where new fields are introduced or existing fields change.
    • Use tools like PySpark or custom scripts to validate data against expected schema.

Remember that schema evolution is an ongoing process, especially in data engineering. Regularly review and adapt your approach based on changing requirements and data sources23.

If you have specific use cases or need further assistance, feel free to ask!

 
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.