04-24-2023 01:56 AM
My data is a dump of JSON response from an API. The schema of the json is
col_name data_type
data array<struct<attributes:struct<name: String, age: Int relationships:struct<address:struct<data:arraay<struct<id: long, type: string>>>>>>>
included array<struct<id: long, type: string, attributes:struct<address: string, postalCode: string, country: string>>>
As you can see the column data contains an array of person details and includes a relationship to that person's address via an id. The column included contains the the actual address.
I want to transform this data into a new table where the person data includes the address. In short I want to get rid of this `included` business. I only have SQL to go with right now because I am using this in a STREAMING LIVE TABLE query.
04-26-2023 12:00 AM
04-24-2023 12:06 PM
Hi @Ashwin Bhaskar, You can use the SQL JOIN operation to join the data and include arrays based on the
id field. Here's an example SQL query that should accomplish this:
SELECT
data.attributes.name AS name,
data.attributes.age AS age,
included.attributes.address AS address,
included.attributes.postalCode AS postalCode,
included.attributes.country AS country
FROM
your_table_name
CROSS JOIN UNNEST(data) AS data
JOIN
UNNEST(data.attributes.relationships.address.data) AS rel
ON rel.id = included.id
JOIN
UNNEST(included) AS included
ON rel.id = included.id AND rel.type = included.type
In this query, we use CROSS JOIN UNNEST to flatten the data array into individual rows. Then we join the rel array from the data array with the included array based on the id field. Finally, we select the relevant fields from both arrays to create a new table comprising the personal data and address information.
Note that this query assumes that each person has only one address and that the id field in the data array matches the id field in the included array. If these assumptions do not hold for your data, you may need to modify the query accordingly.
04-24-2023 08:35 PM
@Kaniz Fatma isn't this basically doing an "explode" on "data" and "included" and then joining them? We end up doing join on the whole data set instead of within the row.
04-25-2023 03:28 AM
Hi @Ashwin Bhaskar, Apologies for the confusion in my previous response. You are correct that the previous solution performs a CROSS JOIN on the entire dataset, which may not be efficient. Instead, we can use an inline UDF to achieve the desired output without a full join.
Here's an example SQL query using an inline UDF (User-Defined Function) in Spark SQL:
CREATE TEMPORARY FUNCTION find_address AS 'org.apache.spark.sql.catalyst.expressions.ArrayFind';
WITH transformed AS (
SELECT
d.attributes.name,
d.attributes.age,
find_address(included, i -> i.id = ANY(d.relationships.address.data)) AS matched_address
FROM
your_table_name
LATERAL VIEW
explode(data) data_table AS d
)
SELECT
name,
age,
matched_address.attributes.address,
matched_address.attributes.postalCode,
matched_address.attributes.country
FROM
transformed;
In this query, we first create a temporary inline UDF called find_address
using the built-in ArrayFind function.
This function searches for an element in an array that satisfies the given condition and returns the first matched element.
Next, we create a temporary view transformed by exploding the data array and using the find_address function to match the address IDs within the same row. Finally, we select the person's details along with their address information from the
transformed view.
Please replace your_table_name with the actual name of your table.
Note that the inline UDF functionality is specific to Spark SQL and might not be available in other SQL environments.
04-26-2023 12:00 AM
I used a similar solution (exploding only one column) and it worked
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group