<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Why am I getting a cast invalid input error when using display()? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/why-am-i-getting-a-cast-invalid-input-error-when-using-display/m-p/141128#M51627</link>
    <description>&lt;P&gt;I have a spark data frame. It consists of a single column, in string format, with 28750 values in it. The values are all 10 digits long. I want to look at the data, like this:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;my_dataframe.display()&lt;/LI-CODE&gt;&lt;P&gt;But this returns the following error:&lt;/P&gt;&lt;P class="lia-indent-padding-left-30px"&gt;&lt;EM&gt;[CAST_INVALID_INPUT] The value 'UNKNOWN' of the type "STRING" cannot be cast to "BIGINT" because it is malformed&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;I also get the same error from this:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;my_dataframe.count()&lt;/LI-CODE&gt;&lt;P&gt;I get that 'UNKNOWN' can't be cast as a big integer because it's not a number. But I ran the SQL that creates the data frame, and the results do not contain 'UNKNOWN'. So I have a few questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Why does Databricks think my data frame contains the string 'UNKNOWN'?&lt;/LI&gt;&lt;LI&gt;Why is the display function casting my data to big integer in the first place?&lt;/LI&gt;&lt;LI&gt;How can I resolve this?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I'm pretty confused, so anything that helps me understand what's going on is appreciated!&lt;/P&gt;&lt;P&gt;If it helps, here is how the data frame is defined:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;my_dataframe = spark.sql(f'''
SELECT A.ID,
'SOME TEXT' AS TEXT
FROM TABLE_1 A
INNER JOIN
TABLE_2 B
ON A.PRODUCT_ID = B.PRODUCT_ID
LEFT JOIN
(
SELECT ID
FROM TABLE_3
WHERE NUMBER IN ({a_series})
GROUP BY ID
) C
ON A.ID = C.ID
LEFT JOIN
(
SELECT ID, MAX(AGE) AS AGE, MAX(GENDER) AS GENDER
FROM TABLE_4
WHERE AGE IS NOT NULL
GROUP BY ID
) D
ON A.ID = D.ID
WHERE A.DATE BETWEEN DATE_SUB(CURRENT_DATE, {a_number}) AND CURRENT_DATE
AND B.CODE = '{a_string}'
AND C.ID IS NULL
AND D.AGE BETWEEN {age_limit_lower} AND {age_limit_upper}
GROUP BY A.ID
LIMIT {another_number}
''')&lt;/LI-CODE&gt;&lt;P&gt;As for the data types of the columns:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;A.ID, A.PRODUCT_ID, B.PRODUCT_ID, D.GENDER, and B.CODE are strings&lt;/LI&gt;&lt;LI&gt;C.ID, D.ID, and C.NUMBER are integers&lt;/LI&gt;&lt;LI&gt;D.AGE is a decimal(8,4)&lt;/LI&gt;&lt;LI&gt;A.DATE is a date&lt;/LI&gt;&lt;/UL&gt;</description>
    <pubDate>Thu, 04 Dec 2025 10:02:03 GMT</pubDate>
    <dc:creator>SRJDB</dc:creator>
    <dc:date>2025-12-04T10:02:03Z</dc:date>
    <item>
      <title>Why am I getting a cast invalid input error when using display()?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-am-i-getting-a-cast-invalid-input-error-when-using-display/m-p/141128#M51627</link>
      <description>&lt;P&gt;I have a spark data frame. It consists of a single column, in string format, with 28750 values in it. The values are all 10 digits long. I want to look at the data, like this:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;my_dataframe.display()&lt;/LI-CODE&gt;&lt;P&gt;But this returns the following error:&lt;/P&gt;&lt;P class="lia-indent-padding-left-30px"&gt;&lt;EM&gt;[CAST_INVALID_INPUT] The value 'UNKNOWN' of the type "STRING" cannot be cast to "BIGINT" because it is malformed&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;I also get the same error from this:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;my_dataframe.count()&lt;/LI-CODE&gt;&lt;P&gt;I get that 'UNKNOWN' can't be cast as a big integer because it's not a number. But I ran the SQL that creates the data frame, and the results do not contain 'UNKNOWN'. So I have a few questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Why does Databricks think my data frame contains the string 'UNKNOWN'?&lt;/LI&gt;&lt;LI&gt;Why is the display function casting my data to big integer in the first place?&lt;/LI&gt;&lt;LI&gt;How can I resolve this?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I'm pretty confused, so anything that helps me understand what's going on is appreciated!&lt;/P&gt;&lt;P&gt;If it helps, here is how the data frame is defined:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;my_dataframe = spark.sql(f'''
SELECT A.ID,
'SOME TEXT' AS TEXT
FROM TABLE_1 A
INNER JOIN
TABLE_2 B
ON A.PRODUCT_ID = B.PRODUCT_ID
LEFT JOIN
(
SELECT ID
FROM TABLE_3
WHERE NUMBER IN ({a_series})
GROUP BY ID
) C
ON A.ID = C.ID
LEFT JOIN
(
SELECT ID, MAX(AGE) AS AGE, MAX(GENDER) AS GENDER
FROM TABLE_4
WHERE AGE IS NOT NULL
GROUP BY ID
) D
ON A.ID = D.ID
WHERE A.DATE BETWEEN DATE_SUB(CURRENT_DATE, {a_number}) AND CURRENT_DATE
AND B.CODE = '{a_string}'
AND C.ID IS NULL
AND D.AGE BETWEEN {age_limit_lower} AND {age_limit_upper}
GROUP BY A.ID
LIMIT {another_number}
''')&lt;/LI-CODE&gt;&lt;P&gt;As for the data types of the columns:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;A.ID, A.PRODUCT_ID, B.PRODUCT_ID, D.GENDER, and B.CODE are strings&lt;/LI&gt;&lt;LI&gt;C.ID, D.ID, and C.NUMBER are integers&lt;/LI&gt;&lt;LI&gt;D.AGE is a decimal(8,4)&lt;/LI&gt;&lt;LI&gt;A.DATE is a date&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Thu, 04 Dec 2025 10:02:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-am-i-getting-a-cast-invalid-input-error-when-using-display/m-p/141128#M51627</guid>
      <dc:creator>SRJDB</dc:creator>
      <dc:date>2025-12-04T10:02:03Z</dc:date>
    </item>
    <item>
      <title>Re: Why am I getting a cast invalid input error when using display()?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-am-i-getting-a-cast-invalid-input-error-when-using-display/m-p/141142#M51630</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/199244"&gt;@SRJDB&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Could you execute&amp;nbsp;&lt;SPAN&gt;my_dataframe&lt;/SPAN&gt;&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;printSchema&lt;/SPAN&gt;&lt;SPAN class=""&gt;(&lt;/SPAN&gt;&lt;SPAN class=""&gt;) and attach result here?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Dec 2025 11:10:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-am-i-getting-a-cast-invalid-input-error-when-using-display/m-p/141142#M51630</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-12-04T11:10:12Z</dc:date>
    </item>
  </channel>
</rss>

