<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Handling Large Integers and None Values in pandas UDFs on Databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/handling-large-integers-and-none-values-in-pandas-udfs-on/m-p/109870#M43416</link>
    <description>&lt;P&gt;Hi Everyone,&lt;/P&gt;&lt;P&gt;I hope this message finds you well.&lt;/P&gt;&lt;P&gt;I am encountering an issue with pandas UDFs on a Databricks shared cluster and would like to seek assistance from the community. Below is a summary of the problem:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Description:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I am working with pandas UDFs to process a column of large integers while preserving the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;LongType&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;data type and correctly handling&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;None&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;values.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Problem:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;When I create a pandas UDF to process a column with large integers and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;None&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;values, the data is being converted to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;float64&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;before entering the UDF, leading to precision loss. Here is a simplified version of my code:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import pandas as pd
import numpy as np
from pyspark.sql.types import LongType, StructField, StructType
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, PandasUDFType, col

schema = StructType([
    StructField("col1", LongType(), True)
])

with open("/temp/data.csv", "w") as file:
    file.write("1234567890111213141\nNone")

# Read from the file
df = spark.read.csv("/temp/data.csv", header=False, schema=schema)
display(df)

@pandas_udf(LongType(), PandasUDFType.SCALAR)
def process_data(col: pd.Series) -&amp;gt; pd.Series:
    return col

result_df = df.withColumn("col1", process_data(col("col1")))
display(result_df)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;OUTPUT:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;col1
1234567890111213141
null

col1
1234567890111213056
null&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I updated the code to ensure the column is an integer with a nullable type:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/54169"&gt;@pandas&lt;/a&gt;_udf(LongType(), PandasUDFType.SCALAR)
def process_data(col: pd.Series) -&amp;gt; pd.Series:
    # Ensure the column is a nullable type
    # Replace None with np.nan
    col = col.replace({None: np.nan})
    # Convert to integer, preserving NaNs
    return col.astype('Int64')&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;However, I still get the same output. I then updated the code to check what input the pandas UDF is receiving:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/54169"&gt;@pandas&lt;/a&gt;_udf(LongType(), PandasUDFType.SCALAR)
def process_data(col: pd.Series) -&amp;gt; pd.Series:
    raise ValueError(col)
    return col&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;OUTPUT:&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;ValueError: 0    1.234568e+18
1             NaN
Name: _0, dtype: float64&lt;/PRE&gt;&lt;P&gt;It seems that when the data is received by the pandas UDF, it is received as&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;float64, which is causing data corruption.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Issue:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Despite casting the column to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;LongType, the data is still being received as&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;float64&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;within the UDF, causing precision loss for large integers.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Question:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;How can I ensure that the data remains in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;LongType&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(or&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Int64&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in pandas) throughout the processing in the pandas UDF, while correctly handling&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;None&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;values?&lt;/P&gt;&lt;P&gt;Any insights or suggestions would be greatly appreciated!&lt;/P&gt;&lt;P&gt;Thank you for your assistance.&lt;/P&gt;&lt;P&gt;Best regards,&lt;BR /&gt;Vineet Kumar Chaure&lt;/P&gt;</description>
    <pubDate>Tue, 11 Feb 2025 18:19:46 GMT</pubDate>
    <dc:creator>vineet_chaure</dc:creator>
    <dc:date>2025-02-11T18:19:46Z</dc:date>
    <item>
      <title>Handling Large Integers and None Values in pandas UDFs on Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/handling-large-integers-and-none-values-in-pandas-udfs-on/m-p/109870#M43416</link>
      <description>&lt;P&gt;Hi Everyone,&lt;/P&gt;&lt;P&gt;I hope this message finds you well.&lt;/P&gt;&lt;P&gt;I am encountering an issue with pandas UDFs on a Databricks shared cluster and would like to seek assistance from the community. Below is a summary of the problem:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Description:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I am working with pandas UDFs to process a column of large integers while preserving the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;LongType&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;data type and correctly handling&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;None&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;values.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Problem:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;When I create a pandas UDF to process a column with large integers and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;None&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;values, the data is being converted to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;float64&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;before entering the UDF, leading to precision loss. Here is a simplified version of my code:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import pandas as pd
import numpy as np
from pyspark.sql.types import LongType, StructField, StructType
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, PandasUDFType, col

schema = StructType([
    StructField("col1", LongType(), True)
])

with open("/temp/data.csv", "w") as file:
    file.write("1234567890111213141\nNone")

# Read from the file
df = spark.read.csv("/temp/data.csv", header=False, schema=schema)
display(df)

@pandas_udf(LongType(), PandasUDFType.SCALAR)
def process_data(col: pd.Series) -&amp;gt; pd.Series:
    return col

result_df = df.withColumn("col1", process_data(col("col1")))
display(result_df)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;OUTPUT:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;col1
1234567890111213141
null

col1
1234567890111213056
null&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I updated the code to ensure the column is an integer with a nullable type:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/54169"&gt;@pandas&lt;/a&gt;_udf(LongType(), PandasUDFType.SCALAR)
def process_data(col: pd.Series) -&amp;gt; pd.Series:
    # Ensure the column is a nullable type
    # Replace None with np.nan
    col = col.replace({None: np.nan})
    # Convert to integer, preserving NaNs
    return col.astype('Int64')&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;However, I still get the same output. I then updated the code to check what input the pandas UDF is receiving:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/54169"&gt;@pandas&lt;/a&gt;_udf(LongType(), PandasUDFType.SCALAR)
def process_data(col: pd.Series) -&amp;gt; pd.Series:
    raise ValueError(col)
    return col&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;OUTPUT:&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;ValueError: 0    1.234568e+18
1             NaN
Name: _0, dtype: float64&lt;/PRE&gt;&lt;P&gt;It seems that when the data is received by the pandas UDF, it is received as&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;float64, which is causing data corruption.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Issue:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Despite casting the column to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;LongType, the data is still being received as&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;float64&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;within the UDF, causing precision loss for large integers.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Question:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;How can I ensure that the data remains in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;LongType&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(or&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Int64&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in pandas) throughout the processing in the pandas UDF, while correctly handling&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;None&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;values?&lt;/P&gt;&lt;P&gt;Any insights or suggestions would be greatly appreciated!&lt;/P&gt;&lt;P&gt;Thank you for your assistance.&lt;/P&gt;&lt;P&gt;Best regards,&lt;BR /&gt;Vineet Kumar Chaure&lt;/P&gt;</description>
      <pubDate>Tue, 11 Feb 2025 18:19:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/handling-large-integers-and-none-values-in-pandas-udfs-on/m-p/109870#M43416</guid>
      <dc:creator>vineet_chaure</dc:creator>
      <dc:date>2025-02-11T18:19:46Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Large Integers and None Values in pandas UDFs on Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/handling-large-integers-and-none-values-in-pandas-udfs-on/m-p/109899#M43424</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/148044"&gt;@vineet_chaure&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;By default, Spark converts LongType to float64 when transferring data to pandas. You can use&amp;nbsp;Arrow-optimized pandas UDFs introduced in Apache Spark 3.5.&lt;/P&gt;
&lt;P&gt;Please try with below code:&lt;/P&gt;
&lt;P&gt;import pandas as pd&lt;BR /&gt;import pyarrow as pa&lt;BR /&gt;from pyspark.sql.functions import pandas_udf&lt;BR /&gt;from pyspark.sql.types import LongType&lt;/P&gt;
&lt;P&gt;# Enable Arrow-based columnar data transfers&lt;BR /&gt;spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")&lt;/P&gt;
&lt;P&gt;# Define the schema&lt;BR /&gt;schema = StructType([StructField("col1", LongType(), True)])&lt;/P&gt;
&lt;P&gt;# Read the data&lt;BR /&gt;df = spark.read.csv("/temp/data.csv", header=False, schema=schema)&lt;/P&gt;
&lt;P&gt;# Define the Arrow-optimized pandas UDF&lt;BR /&gt;@pandas_udf(LongType(), useArrow=True)&lt;BR /&gt;def process_data(col: pd.Series) -&amp;gt; pd.Series:&lt;BR /&gt;# Convert to nullable integer type&lt;BR /&gt;return col.astype(pd.Int64Dtype())&lt;/P&gt;
&lt;P&gt;# Apply the UDF&lt;BR /&gt;result_df = df.withColumn("col1", process_data(col("col1")))&lt;/P&gt;
&lt;P&gt;# Show the result&lt;BR /&gt;result_df.show()&lt;/P&gt;</description>
      <pubDate>Tue, 11 Feb 2025 20:33:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/handling-large-integers-and-none-values-in-pandas-udfs-on/m-p/109899#M43424</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-02-11T20:33:12Z</dc:date>
    </item>
  </channel>
</rss>

