cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why am I getting a cast invalid input error when using display()?

SRJDB
New Contributor

I have a spark data frame. It consists of a single column, in string format, with 28750 values in it. The values are all 10 digits long. I want to look at the data, like this:

my_dataframe.display()

But this returns the following error:

[CAST_INVALID_INPUT] The value 'UNKNOWN' of the type "STRING" cannot be cast to "BIGINT" because it is malformed

I also get the same error from this:

my_dataframe.count()

I get that 'UNKNOWN' can't be cast as a big integer because it's not a number. But I ran the SQL that creates the data frame, and the results do not contain 'UNKNOWN'. So I have a few questions:

  • Why does Databricks think my data frame contains the string 'UNKNOWN'?
  • Why is the display function casting my data to big integer in the first place?
  • How can I resolve this?

I'm pretty confused, so anything that helps me understand what's going on is appreciated!

If it helps, here is how the data frame is defined:

my_dataframe = spark.sql(f'''
SELECT A.ID,
'SOME TEXT' AS TEXT
FROM TABLE_1 A
INNER JOIN
TABLE_2 B
ON A.PRODUCT_ID = B.PRODUCT_ID
LEFT JOIN
(
SELECT ID
FROM TABLE_3
WHERE NUMBER IN ({a_series})
GROUP BY ID
) C
ON A.ID = C.ID
LEFT JOIN
(
SELECT ID, MAX(AGE) AS AGE, MAX(GENDER) AS GENDER
FROM TABLE_4
WHERE AGE IS NOT NULL
GROUP BY ID
) D
ON A.ID = D.ID
WHERE A.DATE BETWEEN DATE_SUB(CURRENT_DATE, {a_number}) AND CURRENT_DATE
AND B.CODE = '{a_string}'
AND C.ID IS NULL
AND D.AGE BETWEEN {age_limit_lower} AND {age_limit_upper}
GROUP BY A.ID
LIMIT {another_number}
''')

As for the data types of the columns:

  • A.ID, A.PRODUCT_ID, B.PRODUCT_ID, D.GENDER, and B.CODE are strings
  • C.ID, D.ID, and C.NUMBER are integers
  • D.AGE is a decimal(8,4)
  • A.DATE is a date
1 REPLY 1

szymon_dybczak
Esteemed Contributor III

Hi @SRJDB ,

Could you execute my_dataframe.printSchema() and attach result here?

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now