Databricks Community

MustangR · ‎02-16-2024

Hi All,

I am trying to merge a json to delta table. Since the Json is basically from MongoDB which does not have a schema, there are chances of having missing attributes expected by delta table schema validation. Schema Evolution is enabled as well. How do i handle this ? Is there a way to add these missing attributes and default values or disable schema validation for such cases. I am using Delta table framework for scala.

Thanks

Palash01 · ‎02-17-2024

Hey @MustangR

Thanks for bringing up your concern, You're right, merging JSON data from MongoDB with a Delta table can pose challenges due to schema differences. Here are two ways which I could think of to handle this situation:

1. Leveraging Schema Evolution:

Automatic Schema Evolution: When schema evolution is enabled, Delta table allows adding new columns during writes. However, it treats missing required columns differently:
- Schema on Read vs. Write: If the missing column is present in the Delta table schema used for reading but absent in the write schema, the merge will fail.
- New Required Column: If the missing column is a newly introduced requirement in the Delta table schema, the merge will fail.

How to solve this problem:

Existing Optional Columns: Missing optional columns will have NULL values during merge, no further action needed.
New Optional Columns: If you anticipate encountering new optional columns in the JSON data, add them to your Delta table schema with NULL default values. This allows automatic handling by schema evolution.
Required Columns Present but not in Write Schema: Modify your write process to include the required column from the Delta table schema.
Required Columns Newly Introduced: If the requirement is recent, consider adding the column to the Delta table schema with a suitable default value to allow future merges. You can then backfill existing data if needed.

2. Preprocessing JSON Data:

Identify Missing Attributes: Before merging, analyze your MongoDB JSON data to identify potentially missing attributes compared to the Delta table schema.
Add Missing Attributes: Add missing attributes to the JSON data with default values before writing them to Delta. You can use libraries like spark-sql-catalyst to perform schema-aware transformations.
Handle Schema Violations: If you cannot add missing attributes, you can create a separate "error" table to store rows that violate the Delta table schema.

Leave a like if this helps! Kudos,
Palash