Re: How to parse VARIANT type column using Pyspark...

juanicobsider · ‎08-05-2024

I trying to parse VARIANT data type column, what is the correct sintax to parse sub columns using Pyspark, is it possible?.I'd like to know how to do it this way (I know how to do it using SQL syntax).

szymon_dybczak · ‎08-05-2024

Hi @juanicobsider ,

I think that syntax is not fully supported yet in pyspark. As a workaround you can use expr like below:

from pyspark.sql import Row
from pyspark.sql.functions import parse_json,col, expr

json_string = '{"title":"example", "animal": "test"}'
df = spark.createDataFrame([
    Row(json_col=json_string)
    ]
)

df = (
    df.select(
        parse_json(
            col("json_col")  ).alias("json_col")
    )      
)

display(df.select(expr("json_col:animal")))

Witold · ‎08-06-2024

As an addition to what @szymon_dybczak already said correctly. It's actually not a workaround, it's designed and documented that way. Make sure that you understand the difference between `:`, and `.`.

Regarding PySpark, the API has other variant related functions as well, like variant_get.

How to parse VARIANT type column using Pyspark sintax?