Palash01
Valued Contributor

Hey @MustangR 

Thanks for bringing up your concern, You're right, merging JSON data from MongoDB with a Delta table can pose challenges due to schema differences. Here are two ways which I could think of to handle this situation:

1. Leveraging Schema Evolution:

  • Automatic Schema Evolution: When schema evolution is enabled, Delta table allows adding new columns during writes. However, it treats missing required columns differently:
    • Schema on Read vs. Write: If the missing column is present in the Delta table schema used for reading but absent in the write schema, the merge will fail.
    • New Required Column: If the missing column is a newly introduced requirement in the Delta table schema, the merge will fail.

How to solve this problem: 

  • Existing Optional Columns: Missing optional columns will have NULL values during merge, no further action needed.
  • New Optional Columns: If you anticipate encountering new optional columns in the JSON data, add them to your Delta table schema with NULL default values. This allows automatic handling by schema evolution.
  • Required Columns Present but not in Write Schema: Modify your write process to include the required column from the Delta table schema.
  • Required Columns Newly Introduced: If the requirement is recent, consider adding the column to the Delta table schema with a suitable default value to allow future merges. You can then backfill existing data if needed.

2. Preprocessing JSON Data:

  • Identify Missing Attributes: Before merging, analyze your MongoDB JSON data to identify potentially missing attributes compared to the Delta table schema.
  • Add Missing Attributes: Add missing attributes to the JSON data with default values before writing them to Delta. You can use libraries like spark-sql-catalyst to perform schema-aware transformations.
  • Handle Schema Violations: If you cannot add missing attributes, you can create a separate "error" table to store rows that violate the Delta table schema.

 

Leave a like if this helps! Kudos,
Palash