Databricks

d_meaker · ‎04-04-2023

We are exploding a map type column into multiple columns based on the keys of the map column. Part of this process is to extract the keys of a map type column called json_map as illustrated in the snippet below. The code executes as expected when running it in a notebook, but returns an empty array when running it in a Delta Live Tables pipeline. Below is the snippet of code:

    keys = (
        df
        .select(map_keys("json_map"))
        .distinct()
        .collect()
        )

Does anyone know why this code will run as expected in notebook, but returns empty array in the Delta Live Tables pipeline? Or is there another method to extract the different fields of a map or json column to separate columns?

Anonymous · ‎04-09-2023

@De Vos Meaker :

One potential reason why the code works in a notebook but returns an empty array in a Delta Live Tables pipeline is that there may be differences in the data being processed. It's possible that the pipeline is processing different data that doesn't have any keys in the json_map column, leading to an empty array result.

As for alternative methods to extract the different fields of a map or json column to separate columns, you can try using the

getItem

function in PySpark. Here's an example code snippet:

from pyspark.sql.functions import col
 
df = df.select(
  col("json_map").getItem("key1").alias("column1"),
  col("json_map").getItem("key2").alias("column2"),
  col("json_map").getItem("key3").alias("column3")
)

This code creates new columns column1, column2, and column3 by extracting the values of the keys "key1", "key2", and "key3" from the json_map column using the getItem function. You can customize this code to extract the specific keys you need.

d_meaker · ‎04-17-2023

Hi @Suteja Kanuri ,

Thank you for you response and explanation. The code I have shown above is not the exact snippet we are using. Please find the exact snippet below. We are dynamically extracting the keys of the map and then using getitem() to make columns from the fields, with the key names the names of the columns:

    df = df.withColumn(column_name, from_json(column_name, MapType(StringType(),StringType())))
 
    keys = (
        df
        .select(map_keys(column_name))
        .distinct()
        .collect()
        )
    
    
    df = df.select(
        [col(column_name).getItem(k).alias(k) for k in keys] + [filter_column]
    )

I have checked the data, and its identical. Do you know if the dynamic way is supported in Delta Live Tables?

Anonymous · ‎04-17-2023

@De Vos Meaker :

Since Delta Live Tables is built on top of Delta Lake, which is designed to work with Apache Spark, the dynamic way of extracting keys of a map and making columns from the fields using getItem()

should be supported in Delta Live Tables. However, please test your code to see if it works as expected.

Databricks

map_keys() returns an empty array in Delta Live Table pipeline.

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!