cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

variant_explode_outer stop working after the last DBX runtime patch

norbitek
New Contributor II

Hi All,

I import following JSON to delta table into VARIANT column:

{
    "data": [
        {
            "group": 1,
            "manager": "no",
            "firstname": "John",
            "lastname": "Smith",
            "active": "false",
            "team_lead": "yes"
        }
    ],
    "page": 1,
    "size": 5000,
    "timestamp": 1775455219,
    "total": 1342,
    "totalPages": 1
}

Because of the size I import only data node as VARIANT

To do that I use following logic:

df_transformed = spark.read.format("json")
    .option("modifiedAfter", modified_after)
    .option("singleVariantColumn", "DATA")
    .load(f"{FILE_PATH}")

df_transformed.createOrReplaceTempView("df_transformed")

or streaming version:

df_transformed = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("singleVariantColumn", "DATA")
    .load(f"{FILE_PATH}")
)

df_transformed.createOrReplaceTempView("df_transformed")

and then query:

select * except(DATA) from df_transformed
,LATERAL variant_explode_outer(DATA:data) AS DATA_exploded

worked as expected before 1st of April with runtime 17.3.8

Then DBX upgraded the runtime to version 17.3.9 and streaming version raise following error:

org.apache.spark.sql.AnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE] Unsupported subquery expression: Correlated column reference 'df_transformed.DATA' cannot be variant type. SQLSTATE: 0A000; line 1 pos 22

spark.read version still works as expected

I tested this with the latest version of 18.0, 18.1 and get the same error

The only version that still works is unsupported 17.2

Probably because it didn't get any update

How to fix this problem?

Am I doing something wrong in this query?

 

1 REPLY 1

emma_s
Databricks Employee
Databricks Employee

Hi, 

I've been testing this on a workspace at my end and see exactly the same thing. I'd first recommend raising a support ticket for this. 

In the meantime you can use the following workaround:

I reproduced it on DBR 18.0 using readStream + cloudFiles + singleVariantColumn - the exact error you're seeing:

[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE]
Correlated column reference 'DATA' cannot be variant type.

It only affects streaming. The same query works fine in batch with spark.read. I'd recommend raising a support ticket as this could be a regression from 17.3.8.

Workaround (tested on DBR 18.0 in streaming):

Convert the variant array to a typed array with from_json, then use standard explode_outer:


This casts the variant to a string, parses it into a typed array, and uses explode_outer instead of variant_explode_outer. It avoids the correlated variant reference that
triggers the error.

SELECT record.*
FROM (
SELECT explode_outer(
from_json(
DATA:data::STRING,
'array<struct<group:int, manager:string, firstname:string, lastname:string, active:string, team_lead:string>>'
)
) AS record
FROM df_transformed
) t


The trade-off is you need to define the struct schema in the from_json call, which you don't with variant_explode_outer. But it works in both batch and streaming.