My data is a dump of JSON response from an API. The schema of the json is
col_name data_type data array<struct<attributes:struct<name: String, age: Int relationships:struct<address:struct<data:arraay<struct<id: long, type: string>>>>>>> included array<struct<id: long, type: string, attributes:struct<address: string, postalCode: string, country: string>>>
As you can see the column data contains an array of person details and includes a relationship to that person's address via an id. The column included contains the the actual address.
I want to transform this data into a new table where the person data includes the address. In short I want to get rid of this `included` business. I only have SQL to go with right now because I am using this in a STREAMING LIVE TABLE query.
Hi @Ashwin Bhaskar, You can use the SQL JOIN operation to join the data and include arrays based on the
id field. Here's an example SQL query that should accomplish this:
SELECT data.attributes.name AS name, data.attributes.age AS age, included.attributes.address AS address, included.attributes.postalCode AS postalCode, included.attributes.country AS country FROM your_table_name CROSS JOIN UNNEST(data) AS data JOIN UNNEST(data.attributes.relationships.address.data) AS rel ON rel.id = included.id JOIN UNNEST(included) AS included ON rel.id = included.id AND rel.type = included.type
In this query, we use CROSS JOIN UNNEST to flatten the data array into individual rows. Then we join the rel array from the data array with the included array based on the id field. Finally, we select the relevant fields from both arrays to create a new table comprising the personal data and address information.
Note that this query assumes that each person has only one address and that the id field in the data array matches the id field in the included array. If these assumptions do not hold for your data, you may need to modify the query accordingly.
Hi @Ashwin Bhaskar, Apologies for the confusion in my previous response. You are correct that the previous solution performs a CROSS JOIN on the entire dataset, which may not be efficient. Instead, we can use an inline UDF to achieve the desired output without a full join.
Here's an example SQL query using an inline UDF (User-Defined Function) in Spark SQL:
CREATE TEMPORARY FUNCTION find_address AS 'org.apache.spark.sql.catalyst.expressions.ArrayFind'; WITH transformed AS ( SELECT d.attributes.name, d.attributes.age, find_address(included, i -> i.id = ANY(d.relationships.address.data)) AS matched_address FROM your_table_name LATERAL VIEW explode(data) data_table AS d ) SELECT name, age, matched_address.attributes.address, matched_address.attributes.postalCode, matched_address.attributes.country FROM transformed;
In this query, we first create a temporary inline UDF called find_address
using the built-in ArrayFind function.
This function searches for an element in an array that satisfies the given condition and returns the first matched element.
Next, we create a temporary view transformed by exploding the data array and using the find_address function to match the address IDs within the same row. Finally, we select the person's details along with their address information from the
Please replace your_table_name with the actual name of your table.
Note that the inline UDF functionality is specific to Spark SQL and might not be available in other SQL environments.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!