cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks runtime 14.3 gives error scala.math.BigInt cannot be cast to java.lang.Integer

HeijerM84
New Contributor III

We have a cluster running on 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12).
We want to test with a different type of cluster (14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12))

And all of a sudden we get errors that complain about a casting a Bigint to a java.lang.integer. Running the same (sql) query on the cluster with version 13.3 does not give any errors. 

Does anyone recognize this type of issue and has anybody come up with an idea how to fix this?

1 ACCEPTED SOLUTION

Accepted Solutions

HeijerM84
New Contributor III

I have found when the issue arises. Below is a simplified version of the situation.

I create a temporary VIEW called โ€˜v_dim_Oneโ€™ with a random column and a rownum which has a maximum value of for example 200.

%sql

CREATE OR REPLACE GLOBAL TEMPORARY VIEW v_dim_One AS

SELECT

    row_number() over (order by  ColA asc) AS DimA_NK

    , ColA

FROM TableOne

 

Then I create a second temporary view that has a UNION in it and JOINS the โ€˜v_dim_Oneโ€™ to return the rownum column โ€˜DimA_NKโ€™.

%sql

CREATE OR REPLACE GLOBAL TEMPORARY VIEW v_dim_Two AS

SELECT

ColB

, DimA_NK

FROM TableTwoFirst

LEFT JOIN global_temp.v_dim_One ON ColA=ColB

UNION

SELECT

ColB

, DimA_NK

FROM TableTwoSecond

LEFT JOIN global_temp.v_dim_One ON ColA=ColB

 

Now when I do a select * from  global_temp.v_dim_Two the previously described error occurs.

I did some extensive tests. The error will not occur in the following scenarios:
- dim_Two only selects from TableTwoFirst (so only the part above the union)

- dim_Two only selects from TableTwoSecond (so only the part below the union)

- Directly query the statement used to create temp view โ€˜v_dim_Twoโ€™

 

I can prevent the error from occurring by casting the dimA_NK as a long:

%sql

CREATE OR REPLACE GLOBAL TEMPORARY VIEW v_dim_Two AS

SELECT

ColB

, CAST(DimA_NK as long) as DimA_NK --> CAST AS LONG

FROM TableTwoFirst

LEFT JOIN TableOne ON ColA=ColB

UNION

SELECT

ColB

, CAST(DimA_NK as long) as DimA_NK --> CAST AS LONG

FROM TableTwoSecond

LEFT JOIN TableOne ON ColA=ColB

 

So conclusion, however weird, is to cast every column which is coming from a joined table and which is created through a row_number function, as a LONG whenever a UNION is used. If anyone can explain why this is required even when the rownum value is very small, I am open for reasoning!

View solution in original post

7 REPLIES 7

Hi, Thanks for your response. The error shows the below message as the place where the error is thrown. I don't see how this helps me, as I don't have much knowledge of Spark and Java. Does this message make any sense to you and if so, does that result in a more specific idea on where to find a solution?

error databricks.jpg

Kaniz_Fatma
Community Manager
Community Manager

Hey there! Thanks a bunch for being part of our awesome community! ๐ŸŽ‰ 

We love having you around and appreciate all your questions. Take a moment to check out the responses โ€“ you'll find some great info. Your input is valuable, so pick the best solution for you. And remember, if you ever need more help , we're here for you! 

Keep being awesome! ๐Ÿ˜Š๐Ÿš€

 

HeijerM84
New Contributor III

I have found when the issue arises. Below is a simplified version of the situation.

I create a temporary VIEW called โ€˜v_dim_Oneโ€™ with a random column and a rownum which has a maximum value of for example 200.

%sql

CREATE OR REPLACE GLOBAL TEMPORARY VIEW v_dim_One AS

SELECT

    row_number() over (order by  ColA asc) AS DimA_NK

    , ColA

FROM TableOne

 

Then I create a second temporary view that has a UNION in it and JOINS the โ€˜v_dim_Oneโ€™ to return the rownum column โ€˜DimA_NKโ€™.

%sql

CREATE OR REPLACE GLOBAL TEMPORARY VIEW v_dim_Two AS

SELECT

ColB

, DimA_NK

FROM TableTwoFirst

LEFT JOIN global_temp.v_dim_One ON ColA=ColB

UNION

SELECT

ColB

, DimA_NK

FROM TableTwoSecond

LEFT JOIN global_temp.v_dim_One ON ColA=ColB

 

Now when I do a select * from  global_temp.v_dim_Two the previously described error occurs.

I did some extensive tests. The error will not occur in the following scenarios:
- dim_Two only selects from TableTwoFirst (so only the part above the union)

- dim_Two only selects from TableTwoSecond (so only the part below the union)

- Directly query the statement used to create temp view โ€˜v_dim_Twoโ€™

 

I can prevent the error from occurring by casting the dimA_NK as a long:

%sql

CREATE OR REPLACE GLOBAL TEMPORARY VIEW v_dim_Two AS

SELECT

ColB

, CAST(DimA_NK as long) as DimA_NK --> CAST AS LONG

FROM TableTwoFirst

LEFT JOIN TableOne ON ColA=ColB

UNION

SELECT

ColB

, CAST(DimA_NK as long) as DimA_NK --> CAST AS LONG

FROM TableTwoSecond

LEFT JOIN TableOne ON ColA=ColB

 

So conclusion, however weird, is to cast every column which is coming from a joined table and which is created through a row_number function, as a LONG whenever a UNION is used. If anyone can explain why this is required even when the rownum value is very small, I am open for reasoning!

jeroenvs
New Contributor III

The same bug is affecting me, but only when using Databricks runtime 14.3 LTS single user cluster.

 

I'm using row_number() on string columns, which should result in an integer. However, Spark internally seems to raise an error on not being able to convert Long and Int.

 

Example code:

WITH A AS (
  SELECT
  --opportunityid is a string
    ROW_NUMBER() OVER(ORDER BY tbl_a.opportunityid) AS PK_A
  FROM 20_silver_crmeiw.opportunities AS tbl_a
)

, B AS (
  SELECT
  --mcw_contractid is a string
    ROW_NUMBER() OVER(ORDER BY tbl_b.mcw_contractid) AS PK_B
  FROM 20_silver_crmeiw.mcw_contracts AS tbl_b
)

, C AS (
  SELECT
    PK_A
  FROM A

  UNION
  
  SELECT
    PK_B
  FROM B
)

SELECT
  *
FROM A
LEFT JOIN C
ON 1=1

 

Without the join, this code is fine:

WITH A AS (
  SELECT
  --opportunityid is a string
    ROW_NUMBER() OVER(ORDER BY tbl_a.opportunityid) AS PK_A
  FROM 20_silver_crmeiw.opportunities AS tbl_a
)

, B AS (
  SELECT
  --mcw_contractid is a string
    ROW_NUMBER() OVER(ORDER BY tbl_b.mcw_contractid) AS PK_B
  FROM 20_silver_crmeiw.mcw_contracts AS tbl_b
)

SELECT
PK_A
FROM A

UNION

SELECT
PK_B
FROM B

 

Querying SELECT typeof(PK_A) FROM A and SELECT typeof(PK_B) FROM B both return 'int'.

Kaniz_Fatma
Community Manager
Community Manager

Hi @HeijerM84

  • Databricks Runtime 14.3 LTS is powered by Apache Spark 3.5.0 and was released in February 20241.
  • The release notes mention various new features and improvements, but unfortunately, thereโ€™s no specific mention of the issue youโ€™re facing with row_number() and string columns.
  • Ensure that the join condition between A and C is correctly specified. In your example, youโ€™re using ON 1=1, which means all rows from A will be joined with all rows from C. Make sure this is the intended behavior.
  •  

The example is a simplified query to illustrate the situation. The join condition is not a factor in this issue. The factors that cause the issues are the row_number, union and layered select. 

jeroenvs
New Contributor III

I have logged the issue with Microsoft last week and they confirmed it is a Databricks bug. A fix is supposedly being rolled out at the moment across Databricks regions.

 

As anticipated, we have engaged the Databricks core team to further investigate the issue and get an update.

 

They have confirmed that the issue was caused by a regression in DBR14.3 and had been fixed and should be ready in all regions soon.

 

In meanwhile they also shared the following two configs that can be used to mitigate the issue:

spark.databricks.optimizer.estimateUnion.enabled to false

or 

spark.databricks.optimizer.propagateStatsThroughWindow.enabled to false

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!