cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Explode is giving unexpected results.

BenLambert
Contributor

I have a dataframe with a schema similar to the following:

id: string

array_field: array

   element: struct

          field1: string

          field2: string

          array_field2: array

               element: struct

                     nested_field: string

I am trying to flatten this into rows. 

The issue that I am having is that when I do something like e_df = df.select("id", F.explode("array_field")) it is only returning the exploded values for the first id. I am not sure if this is something simple, but I have wasted a lot of time trying to sort out what the issue is. When I look every id has an associated array field and I would think that the result should be something like:

id, col

1, first element struct

1, second element struct

2, first element struct

2, second element struct 

and so on. Any insight here would be very helpful.

     

1 ACCEPTED SOLUTION

Accepted Solutions

BenLambert
Contributor

It turns out that if the exploded fields don't match the schema that was defined when reading the JSON in the first place that all the data that doesn't match is silently dropped. This is not really nice default behaviour.

View solution in original post

5 REPLIES 5

Kaniz
Community Manager
Community Manager

Hi @BenLambert, It seems like you’re dealing with a nested structure in your DataFrame and want to flatten it into rows.

 

Let’s address this step by step.

 

To achieve the desired result, you’ll need to use the explode function twice: once for the outer array (array_field) and then for the inner array (array_field2).

 

Here’s how you can do it:

  1. First, explode the outer array (array_field) to create multiple rows for each element in the array.
  2. Next, explode the inner array (array_field2) within each of the exploded rows from step 1.

 

Make sure to replace the sample data with your actual DataFrame, and adjust the column names accordingly. This approach should work for flattening nested structures in your DataFrame.

 

Feel free to ask if you encounter any issues or need further assistance! 😊

@Kaniz thanks for the tip. This is the approach I am trying to take, but the issue is that exploding the outer array fails. It only returns the 4 values associated with the first id and not all possible rows.

Just to make it clear, I have about 19,000 ids and when I explode I only get 4 rows that correspond to the first id.

@Kaniz additionally I am getting a message that says something to the effect of:

"This query is on a non-Delta table with many small files. To improve the performance of queries, convert to Delta and run the OPTIMIZE command on the table dbfs:/mnt/bucket/id_1.json.", which makes it seem that it is only reading the file containing the first id for some reason.

BenLambert
Contributor

It turns out that if the exploded fields don't match the schema that was defined when reading the JSON in the first place that all the data that doesn't match is silently dropped. This is not really nice default behaviour.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.