Databricks

qwerty1 · ‎04-24-2023

My data is a dump of JSON response from an API. The schema of the json is

col_name  data_type
 
data           array<struct<attributes:struct<name: String, age: Int relationships:struct<address:struct<data:arraay<struct<id: long, type: string>>>>>>>
 
included    array<struct<id: long, type: string, attributes:struct<address: string, postalCode: string, country: string>>>

As you can see the column data contains an array of person details and includes a relationship to that person's address via an id. The column included contains the the actual address.

I want to transform this data into a new table where the person data includes the address. In short I want to get rid of this `included` business. I only have SQL to go with right now because I am using this in a STREAMING LIVE TABLE query.

qwerty1 · ‎04-26-2023

I used a similar solution (exploding only one column) and it worked

View solution in original post

Kaniz · ‎04-24-2023

Hi @Ashwin Bhaskar, You can use the SQL JOIN operation to join the data and include arrays based on the

id field. Here's an example SQL query that should accomplish this:

SELECT
  data.attributes.name AS name,
  data.attributes.age AS age,
  included.attributes.address AS address,
  included.attributes.postalCode AS postalCode,
  included.attributes.country AS country
FROM
  your_table_name
CROSS JOIN UNNEST(data) AS data
JOIN
  UNNEST(data.attributes.relationships.address.data) AS rel
  ON rel.id = included.id
JOIN
  UNNEST(included) AS included
  ON rel.id = included.id AND rel.type = included.type

In this query, we use CROSS JOIN UNNEST to flatten the data array into individual rows. Then we join the rel array from the data array with the included array based on the id field. Finally, we select the relevant fields from both arrays to create a new table comprising the personal data and address information.

Note that this query assumes that each person has only one address and that the id field in the data array matches the id field in the included array. If these assumptions do not hold for your data, you may need to modify the query accordingly.

qwerty1 · ‎04-24-2023

@Kaniz Fatma isn't this basically doing an "explode" on "data" and "included" and then joining them? We end up doing join on the whole data set instead of within the row.

Kaniz · ‎04-25-2023

Hi @Ashwin Bhaskar, Apologies for the confusion in my previous response. You are correct that the previous solution performs a CROSS JOIN on the entire dataset, which may not be efficient. Instead, we can use an inline UDF to achieve the desired output without a full join.

Here's an example SQL query using an inline UDF (User-Defined Function) in Spark SQL:

CREATE TEMPORARY FUNCTION find_address AS 'org.apache.spark.sql.catalyst.expressions.ArrayFind';
 
WITH transformed AS (
  SELECT
    d.attributes.name,
    d.attributes.age,
    find_address(included, i -> i.id = ANY(d.relationships.address.data)) AS matched_address
  FROM
    your_table_name
  LATERAL VIEW
    explode(data) data_table AS d
)
SELECT
  name,
  age,
  matched_address.attributes.address,
  matched_address.attributes.postalCode,
  matched_address.attributes.country
FROM
  transformed;

In this query, we first create a temporary inline UDF called find_address

using the built-in ArrayFind function.

This function searches for an element in an array that satisfies the given condition and returns the first matched element.

Next, we create a temporary view transformed by exploding the data array and using the find_address function to match the address IDs within the same row. Finally, we select the person's details along with their address information from the

transformed view.

Please replace your_table_name with the actual name of your table.

Note that the inline UDF functionality is specific to Spark SQL and might not be available in other SQL environments.