cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks SQL and Engineer Notebooks yields different outputs from same script

mortenhaga
Contributor

Hi all

We are having some alarming issues regarding a script that yields different output when running on SQL vs Notebook. The correct output should be 8625 rows which it is in the notebook, but the output in Databricks SQL is 156 rows. The script uses widgets in both Notebook and SQL but we have tried with hardcored values as well. 

The script is too large to paste in here, so please get in touch with me to to obtain it.

We have no idea why this is happening. Can anyone please help?

Serverelss cluster infoNotebook cluster infoWrong output in notebook attaching servereless clusterWrong output in SQL serverlessCorrect output in notebook with notebook compute/cluster

 

1 ACCEPTED SOLUTION

Accepted Solutions

mortenhaga
Contributor

UPDATE:
I think we have identefied and solved the issue. It seems like using LAST with Databricks SQL requires to excplicitly be careful about setting the "ignoreNull" argument and also be careful about the correct datatype. I guess this is because of Databricks SQL using ANSI. Using LAST in notebook, all of this is taking care of by spark under the hood. 

With this in mind, we might just use notebooks instead of Databricks SQL solely as our got-to query UI, even for simple queries, so that we are 100% sure that we get correct results. @Retired_mod 

View solution in original post

3 REPLIES 3

Hi Kaniz

Thanks for taking your time to reply. 

Widgets: widgets is not a part of the problem as we have tested hardcoded values (strings and dates) that replace the need for widgets. Still no luck.

Cluster: Regarding your tip on cluster configurations, there is very limited configurations on a serverless cluster in comparison to a notebook cluster, so Im not sure about how we can align the clusters at all. 

Dependencies/Libraries: We use only built-in libraries and dependencies in the notebook cluster and again, since the serverless cluster have limited configurations, this does not seem to be the issue.

Clearing output: We have tried that.

Where can I get in touch with a databricks expert?

mortenhaga
Contributor

UPDATE: 

It seems to be something with the use of LAST function in Databricks SQL that is behind the different outputs. 

 

final as (select 

Skolekontrakt,
SchoolYear,
sale_id,
Dato,
last(contact_id) contact_id ,
last(SisteVerdi) SisteVerdi ,
last(Status) Status ,
last(SistOppdatert) SistOppdatert ,
Webbestilling01,
Medarbeider,
SalgsenhetKat,
SalgskortOpprettet,
SalgsDato,
Avbruddsdato,
ZipCode

from HovedTABELL

group by
Skolekontrakt,
SchoolYear,
sale_id,
Webbestilling01,
Medarbeider,
SalgsenhetKat,
SalgskortOpprettet,
SalgsDato,
Avbruddsdato,
Dato,
ZipCode)

select
  H.*,
  year(H.SalgsDato) SignYear,
  weekofyear(date(H.SalgsDato)) SignWeek,
  weekofyear(H.Dato)
from
  final H
where
  H.Status in (10,11)
  and year(H.SalgsDato) = year(H.Dato) --and weekofyear(H.SalgsDato)=weekofyear(Dato)-1
  --VED STATUS I SALGSPERIODER--
  and int(weekofyear(date(H.SalgsDato))) < int(weekofyear(H.Dato)) 

 

So, removing the groupby statement yields the same results in notebook and Databricks SQL. Also, using fex count and groupby produces the same output. But using LAST produces different outputs. @Retired_mod are you aware of the difference of using LAST in notebook vs SQL? Anyone else?

mortenhaga
Contributor

UPDATE:
I think we have identefied and solved the issue. It seems like using LAST with Databricks SQL requires to excplicitly be careful about setting the "ignoreNull" argument and also be careful about the correct datatype. I guess this is because of Databricks SQL using ANSI. Using LAST in notebook, all of this is taking care of by spark under the hood. 

With this in mind, we might just use notebooks instead of Databricks SQL solely as our got-to query UI, even for simple queries, so that we are 100% sure that we get correct results. @Retired_mod 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group