cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks SQL and Engineer Notebooks yields different outputs from same script

mortenhaga
Contributor

Hi all

We are having some alarming issues regarding a script that yields different output when running on SQL vs Notebook. The correct output should be 8625 rows which it is in the notebook, but the output in Databricks SQL is 156 rows. The script uses widgets in both Notebook and SQL but we have tried with hardcored values as well. 

The script is too large to paste in here, so please get in touch with me to to obtain it.

We have no idea why this is happening. Can anyone please help?

Serverelss cluster infoNotebook cluster infoWrong output in notebook attaching servereless clusterWrong output in SQL serverlessCorrect output in notebook with notebook compute/cluster

 

1 ACCEPTED SOLUTION

Accepted Solutions

mortenhaga
Contributor

UPDATE:
I think we have identefied and solved the issue. It seems like using LAST with Databricks SQL requires to excplicitly be careful about setting the "ignoreNull" argument and also be careful about the correct datatype. I guess this is because of Databricks SQL using ANSI. Using LAST in notebook, all of this is taking care of by spark under the hood. 

With this in mind, we might just use notebooks instead of Databricks SQL solely as our got-to query UI, even for simple queries, so that we are 100% sure that we get correct results. @Kaniz_Fatma 

View solution in original post

5 REPLIES 5

Kaniz_Fatma
Community Manager
Community Manager

Hi @mortenhagaBased on the information you've provided, there could be a few reasons why you're seeing a discrepancy in the number of rows when running the same script in SQL and Notebook in Databricks.

Here are a few possible explanations and solutions.

1. **Widget State Issue**: There is a known issue where a widget state may not adequately clear after pressing Run All, even after clearing or removing the widget in the code. This can cause a discrepancy between the widget's visual and printed states. Re-running the cells individually may bypass this issue.

To avoid this issue entirely, Databricks recommends using ipywidgets [^1^].

2. **Data Skewness**: If your data is skewed, meaning that the distribution of values in your data set is uneven, this can cause differences in output. Make sure your data is evenly distributed.

3. **Different Cluster Configurations**: If you're running your SQL queries and notebook on different clusters, the configurations of these clusters might be different, leading to different results. Make sure both groups have the same configurations. 

4. **Different Versions of Libraries/Dependencies**: Using different versions of libraries or dependencies in your SQL environment and your notebook environment could also lead to discrepancies. Ensure that both environments use the same versions of all libraries and dependencies.

5. **Clearing State and Outputs**: Clearing the notebook state and outputs can sometimes resolve discrepancies. You can do this by selecting one of the Clear options at the bottom of the Run menu [^4^]. If you've checked all of these potential issues and you're still seeing a discrepancy, having a Databricks expert look at your specific script might help diagnose the problem.

Sources:
[^1^]: [Docs: widgets](https://docs.databricks.com/notebooks/widgets.html)
[^4^]: [Docs: notebook-outputs](https://docs.databricks.com/notebooks/notebook-outputs.html)

Hi Kaniz

Thanks for taking your time to reply. 

Widgets: widgets is not a part of the problem as we have tested hardcoded values (strings and dates) that replace the need for widgets. Still no luck.

Cluster: Regarding your tip on cluster configurations, there is very limited configurations on a serverless cluster in comparison to a notebook cluster, so Im not sure about how we can align the clusters at all. 

Dependencies/Libraries: We use only built-in libraries and dependencies in the notebook cluster and again, since the serverless cluster have limited configurations, this does not seem to be the issue.

Clearing output: We have tried that.

Where can I get in touch with a databricks expert?

mortenhaga
Contributor

UPDATE: 

It seems to be something with the use of LAST function in Databricks SQL that is behind the different outputs. 

 

final as (select 

Skolekontrakt,
SchoolYear,
sale_id,
Dato,
last(contact_id) contact_id ,
last(SisteVerdi) SisteVerdi ,
last(Status) Status ,
last(SistOppdatert) SistOppdatert ,
Webbestilling01,
Medarbeider,
SalgsenhetKat,
SalgskortOpprettet,
SalgsDato,
Avbruddsdato,
ZipCode

from HovedTABELL

group by
Skolekontrakt,
SchoolYear,
sale_id,
Webbestilling01,
Medarbeider,
SalgsenhetKat,
SalgskortOpprettet,
SalgsDato,
Avbruddsdato,
Dato,
ZipCode)

select
  H.*,
  year(H.SalgsDato) SignYear,
  weekofyear(date(H.SalgsDato)) SignWeek,
  weekofyear(H.Dato)
from
  final H
where
  H.Status in (10,11)
  and year(H.SalgsDato) = year(H.Dato) --and weekofyear(H.SalgsDato)=weekofyear(Dato)-1
  --VED STATUS I SALGSPERIODER--
  and int(weekofyear(date(H.SalgsDato))) < int(weekofyear(H.Dato)) 

 

So, removing the groupby statement yields the same results in notebook and Databricks SQL. Also, using fex count and groupby produces the same output. But using LAST produces different outputs. @Kaniz_Fatma are you aware of the difference of using LAST in notebook vs SQL? Anyone else?

mortenhaga
Contributor

UPDATE:
I think we have identefied and solved the issue. It seems like using LAST with Databricks SQL requires to excplicitly be careful about setting the "ignoreNull" argument and also be careful about the correct datatype. I guess this is because of Databricks SQL using ANSI. Using LAST in notebook, all of this is taking care of by spark under the hood. 

With this in mind, we might just use notebooks instead of Databricks SQL solely as our got-to query UI, even for simple queries, so that we are 100% sure that we get correct results. @Kaniz_Fatma 

Hi @mortenhagaYes, you are correct. When using the LAST or last_value function in Databricks SQL, you need to be explicit about setting the ignoreNull argument and also ensure the correct datatype. This is because Databricks SQL uses ANSI SQL, where the default value  ignoreNull is false, meaning it will consider NULL values unless specified otherwise.

In contrast, when using LAST the function in PySpark notebooks, the ignoreNulls argument is also available and, by default, set to false, but it is handled more flexibly. If all values are null, then null is returned.

If ignoreNulls It is set to true and will return the last non-null value it sees. However, it's important to note that choosing between Databricks SQL and notebooks should not be solely based on this particular function behaviour.

Both interfaces have their strengths and use cases. Databricks SQL provides a familiar SQL interface that is easy to use for simple queries, while notebooks offer more flexibility and are more suited for complex data processing tasks. 

Here are the relevant sources:
1. [Databricks SQL LAST_VALUE Function](https://docs.databricks.com/sql/language-manual/functions/last_value.html)
2. [Databricks SQL LAST Function](https://docs.databricks.com/sql/language-manual/functions/last.html)
3. [PySpark LAST Function](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.last...)

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!