โ04-12-2022 10:15 AM
Hello,
We are working to migrate to databricks runtime 10.4 LTS from 9.1 LTS but we're running into weird behavioral issues. Our existing code works up until runtime 10.3 and in 10.4 it stopped working.
Problem:
We have a nested json file that we are flattening into a spark data frame using the code below:
adaccountsdf = df.withColumn('Exp_Organizations', F.explode(F.col('organizations.organization')))\
.withColumn('Exp_AdAccounts', F.explode(F.col('Exp_Organizations.ad_accounts')))\
.select(F.col('Exp_Organizations.id').alias('organizationId'),
F.col('Exp_Organizations.name').alias('organizationName'),
F.col('Exp_AdAccounts.id').alias('adAccountId'),
F.col('Exp_AdAccounts.name').alias('adAccountName'),
F.col('Exp_AdAccounts.timezone').alias('timezone'))
Now when we query the dataframe it works when we do the following selects (hid results due to confidentiality):
display(adaccountsdf.select("*"))
OR
display(adaccountsdf)
When I display the schema of the dataframe we get the following:
root
|-- organizationId: string (nullable = true)
|-- organizationName: string (nullable = true)
|-- adAccountId: string (nullable = true)
|-- adAccountName: string (nullable = true)
|-- timezone: string (nullable = true)
so everything looks like it should. The moment we start selecting the last 3 fields(adAccountId, adAccountName and timezone) we get the following error:
However when we select a single column it works fine:
Does anyone know why this is happening? It's a very strange error that only shows up in databricks runtime 10.4. All previous runtimes incl 10.3, 10.2,10.1 and 9.1 LTS work fine. The issue seems to be caused by using the explode function on an already exploded column in the dataframe.
UPDATE:
For some reason when I run adaccountsdf.cache() before I run my select statements the issue disappears. Would still like to know what's causing this issue in runtime 10.4 but not the other ones.
โ04-20-2022 08:59 AM
It seems like the issue was miraculously resolved. I did not make any code changes but everything is now running as expected.
Maybe the latest runtime 10.4 fix released on April 19th also resolved this issue unintentionally.
โ04-20-2022 08:59 AM
It seems like the issue was miraculously resolved. I did not make any code changes but everything is now running as expected.
Maybe the latest runtime 10.4 fix released on April 19th also resolved this issue unintentionally.
โ04-21-2022 03:55 AM
@Emiel Smeenkโ
We were facing the same issue and suddenly 2022-Apr-20 onwards it resolved itself.
Question:- Is there any website where I can see/track these "patches"?
Edit: Added Question.
โ04-26-2022 11:45 AM
@Kaniz Fatmaโ
Your answer suffices my query. Thanks!
In addition, for fellow developers, I later noticed that these release notes are also available on the home screen of your Databricks workspace.
โ04-26-2022 12:54 PM
@Kaniz Fatmaโ I did not ask the original question.
@Emiel Smeenkโ had asked and answered his own question stating that the issue was fixed on its own (probably due to latest patch).
โ04-26-2022 01:16 PM
Issue resolved on its own so selected that as the best answer for this post.
Thanks,
Emiel
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group