Handling Large Integers and None Values in pandas UDFs on Databricks

vineet_chaure — Tue, 11 Feb 2025 18:19:46 GMT

Hi Everyone,

I hope this message finds you well.

I am encountering an issue with pandas UDFs on a Databricks shared cluster and would like to seek assistance from the community. Below is a summary of the problem:

Description:

I am working with pandas UDFs to process a column of large integers while preserving the LongType data type and correctly handling None values.

Problem:

When I create a pandas UDF to process a column with large integers and None values, the data is being converted to float64 before entering the UDF, leading to precision loss. Here is a simplified version of my code:

import pandas as pd import numpy as np from pyspark.sql.types import LongType, StructField, StructType from pyspark.sql import SparkSession from pyspark.sql.functions import pandas_udf, PandasUDFType, col schema = StructType([ StructField("col1", LongType(), True) ]) with open("/temp/data.csv", "w") as file: file.write("1234567890111213141\nNone") # Read from the file df = spark.read.csv("/temp/data.csv", header=False, schema=schema) display(df) @pandas_udf(LongType(), PandasUDFType.SCALAR) def process_data(col: pd.Series) -> pd.Series: return col result_df = df.withColumn("col1", process_data(col("col1"))) display(result_df)

OUTPUT:

col1 1234567890111213141 null col1 1234567890111213056 null

I updated the code to ensure the column is an integer with a nullable type:

@pandas_udf(LongType(), PandasUDFType.SCALAR) def process_data(col: pd.Series) -> pd.Series: # Ensure the column is a nullable type # Replace None with np.nan col = col.replace({None: np.nan}) # Convert to integer, preserving NaNs return col.astype('Int64')

However, I still get the same output. I then updated the code to check what input the pandas UDF is receiving:

@pandas_udf(LongType(), PandasUDFType.SCALAR) def process_data(col: pd.Series) -> pd.Series: raise ValueError(col) return col

OUTPUT:

ValueError: 0    1.234568e+18
1             NaN
Name: _0, dtype: float64

It seems that when the data is received by the pandas UDF, it is received as float64, which is causing data corruption.

Issue:

Despite casting the column to LongType, the data is still being received as float64 within the UDF, causing precision loss for large integers.

Question:

How can I ensure that the data remains in LongType (or Int64 in pandas) throughout the processing in the pandas UDF, while correctly handling None values?

Any insights or suggestions would be greatly appreciated!

Thank you for your assistance.

Best regards,
Vineet Kumar Chaure

Re: Handling Large Integers and None Values in pandas UDFs on Databricks

Alberto_Umana — Tue, 11 Feb 2025 20:33:12 GMT

Hello @vineet_chaure,

By default, Spark converts LongType to float64 when transferring data to pandas. You can use Arrow-optimized pandas UDFs introduced in Apache Spark 3.5.

Please try with below code:

import pandas as pd
import pyarrow as pa
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import LongType

# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Define the schema
schema = StructType([StructField("col1", LongType(), True)])

# Read the data
df = spark.read.csv("/temp/data.csv", header=False, schema=schema)

# Define the Arrow-optimized pandas UDF
@pandas_udf(LongType(), useArrow=True)
def process_data(col: pd.Series) -> pd.Series:
# Convert to nullable integer type
return col.astype(pd.Int64Dtype())

# Apply the UDF
result_df = df.withColumn("col1", process_data(col("col1")))

# Show the result
result_df.show()

topic Handling Large Integers and None Values in pandas UDFs on Databricks in Data Engineering

Handling Large Integers and None Values in pandas UDFs on Databricks

Re: Handling Large Integers and None Values in pandas UDFs on Databricks