I have a spark data frame. It consists of a single column, in string format, with 28750 values in it. The values are all 10 digits long. I want to look at the data, like this:
my_dataframe.display()
But this returns the following error:
[CAST_INVALID_INPUT] The value 'UNKNOWN' of the type "STRING" cannot be cast to "BIGINT" because it is malformed
I also get the same error from this:
my_dataframe.count()
I get that 'UNKNOWN' can't be cast as a big integer because it's not a number. But I ran the SQL that creates the data frame, and the results do not contain 'UNKNOWN'. So I have a few questions:
- Why does Databricks think my data frame contains the string 'UNKNOWN'?
- Why is the display function casting my data to big integer in the first place?
- How can I resolve this?
I'm pretty confused, so anything that helps me understand what's going on is appreciated!
If it helps, here is how the data frame is defined:
my_dataframe = spark.sql(f'''
SELECT A.ID,
'SOME TEXT' AS TEXT
FROM TABLE_1 A
INNER JOIN
TABLE_2 B
ON A.PRODUCT_ID = B.PRODUCT_ID
LEFT JOIN
(
SELECT ID
FROM TABLE_3
WHERE NUMBER IN ({a_series})
GROUP BY ID
) C
ON A.ID = C.ID
LEFT JOIN
(
SELECT ID, MAX(AGE) AS AGE, MAX(GENDER) AS GENDER
FROM TABLE_4
WHERE AGE IS NOT NULL
GROUP BY ID
) D
ON A.ID = D.ID
WHERE A.DATE BETWEEN DATE_SUB(CURRENT_DATE, {a_number}) AND CURRENT_DATE
AND B.CODE = '{a_string}'
AND C.ID IS NULL
AND D.AGE BETWEEN {age_limit_lower} AND {age_limit_upper}
GROUP BY A.ID
LIMIT {another_number}
''')
As for the data types of the columns:
- A.ID, A.PRODUCT_ID, B.PRODUCT_ID, D.GENDER, and B.CODE are strings
- C.ID, D.ID, and C.NUMBER are integers
- D.AGE is a decimal(8,4)
- A.DATE is a date