Databricks Community

mortenhaga · ‎08-31-2023

Hi all

We are having some alarming issues regarding a script that yields different output when running on SQL vs Notebook. The correct output should be 8625 rows which it is in the notebook, but the output in Databricks SQL is 156 rows. The script uses widgets in both Notebook and SQL but we have tried with hardcored values as well.

The script is too large to paste in here, so please get in touch with me to to obtain it.

We have no idea why this is happening. Can anyone please help?

mortenhaga · ‎09-05-2023

UPDATE:
I think we have identefied and solved the issue. It seems like using LAST with Databricks SQL requires to excplicitly be careful about setting the "ignoreNull" argument and also be careful about the correct datatype. I guess this is because of Databricks SQL using ANSI. Using LAST in notebook, all of this is taking care of by spark under the hood.

With this in mind, we might just use notebooks instead of Databricks SQL solely as our got-to query UI, even for simple queries, so that we are 100% sure that we get correct results. @Retired_mod

View solution in original post

mortenhaga · ‎08-31-2023

Hi Kaniz

Thanks for taking your time to reply.

Widgets: widgets is not a part of the problem as we have tested hardcoded values (strings and dates) that replace the need for widgets. Still no luck.

Cluster: Regarding your tip on cluster configurations, there is very limited configurations on a serverless cluster in comparison to a notebook cluster, so Im not sure about how we can align the clusters at all.

Dependencies/Libraries: We use only built-in libraries and dependencies in the notebook cluster and again, since the serverless cluster have limited configurations, this does not seem to be the issue.

Clearing output: We have tried that.

Where can I get in touch with a databricks expert?

mortenhaga · ‎09-01-2023

UPDATE:

It seems to be something with the use of LAST function in Databricks SQL that is behind the different outputs.

final as (select 

Skolekontrakt,
SchoolYear,
sale_id,
Dato,
last(contact_id) contact_id ,
last(SisteVerdi) SisteVerdi ,
last(Status) Status ,
last(SistOppdatert) SistOppdatert ,
Webbestilling01,
Medarbeider,
SalgsenhetKat,
SalgskortOpprettet,
SalgsDato,
Avbruddsdato,
ZipCode

from HovedTABELL

group by
Skolekontrakt,
SchoolYear,
sale_id,
Webbestilling01,
Medarbeider,
SalgsenhetKat,
SalgskortOpprettet,
SalgsDato,
Avbruddsdato,
Dato,
ZipCode)

select
  H.*,
  year(H.SalgsDato) SignYear,
  weekofyear(date(H.SalgsDato)) SignWeek,
  weekofyear(H.Dato)
from
  final H
where
  H.Status in (10,11)
  and year(H.SalgsDato) = year(H.Dato) --and weekofyear(H.SalgsDato)=weekofyear(Dato)-1
  --VED STATUS I SALGSPERIODER--
  and int(weekofyear(date(H.SalgsDato))) < int(weekofyear(H.Dato))

So, removing the groupby statement yields the same results in notebook and Databricks SQL. Also, using fex count and groupby produces the same output. But using LAST produces different outputs. @Retired_mod are you aware of the difference of using LAST in notebook vs SQL? Anyone else?

mortenhaga · ‎09-05-2023

UPDATE:
I think we have identefied and solved the issue. It seems like using LAST with Databricks SQL requires to excplicitly be careful about setting the "ignoreNull" argument and also be careful about the correct datatype. I guess this is because of Databricks SQL using ANSI. Using LAST in notebook, all of this is taking care of by spark under the hood.

With this in mind, we might just use notebooks instead of Databricks SQL solely as our got-to query UI, even for simple queries, so that we are 100% sure that we get correct results. @Retired_mod