Databricks Community

juanicobsider · ‎08-05-2024

I trying to parse VARIANT data type column, what is the correct sintax to parse sub columns using Pyspark, is it possible?.I'd like to know how to do it this way (I know how to do it using SQL syntax).

szymon_dybczak · ‎08-05-2024

Hi @juanicobsider ,

I think that syntax is not fully supported yet in pyspark. As a workaround you can use expr like below:

from pyspark.sql import Row
from pyspark.sql.functions import parse_json,col, expr

json_string = '{"title":"example", "animal": "test"}'
df = spark.createDataFrame([
    Row(json_col=json_string)
    ]
)

df = (
    df.select(
        parse_json(
            col("json_col")  ).alias("json_col")
    )      
)

display(df.select(expr("json_col:animal")))

Witold · ‎08-06-2024

As an addition to what @szymon_dybczak already said correctly. It's actually not a workaround, it's designed and documented that way. Make sure that you understand the difference between `:`, and `.`.

Regarding PySpark, the API has other variant related functions as well, like variant_get.

Databricks Community

How to parse VARIANT type column using Pyspark sintax?

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon