cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Handling Large Integers and None Values in pandas UDFs on Databricks

vineet_chaure
New Contributor

Hi Everyone,

I hope this message finds you well.

I am encountering an issue with pandas UDFs on a Databricks shared cluster and would like to seek assistance from the community. Below is a summary of the problem:

Description:

I am working with pandas UDFs to process a column of large integers while preserving the LongType data type and correctly handling None values.

Problem:

When I create a pandas UDF to process a column with large integers and None values, the data is being converted to float64 before entering the UDF, leading to precision loss. Here is a simplified version of my code:

 

 

 

import pandas as pd
import numpy as np
from pyspark.sql.types import LongType, StructField, StructType
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, PandasUDFType, col

schema = StructType([
    StructField("col1", LongType(), True)
])

with open("/temp/data.csv", "w") as file:
    file.write("1234567890111213141\nNone")

# Read from the file
df = spark.read.csv("/temp/data.csv", header=False, schema=schema)
display(df)

@pandas_udf(LongType(), PandasUDFType.SCALAR)
def process_data(col: pd.Series) -> pd.Series:
    return col

result_df = df.withColumn("col1", process_data(col("col1")))
display(result_df)

 

 

 

OUTPUT:

 

 

 

col1
1234567890111213141
null

col1
1234567890111213056
null

 

 

 

I updated the code to ensure the column is an integer with a nullable type:

 

 

 

@pandas_udf(LongType(), PandasUDFType.SCALAR)
def process_data(col: pd.Series) -> pd.Series:
    # Ensure the column is a nullable type
    # Replace None with np.nan
    col = col.replace({None: np.nan})
    # Convert to integer, preserving NaNs
    return col.astype('Int64')

 

 

 

However, I still get the same output. I then updated the code to check what input the pandas UDF is receiving:

 

 

 

@pandas_udf(LongType(), PandasUDFType.SCALAR)
def process_data(col: pd.Series) -> pd.Series:
    raise ValueError(col)
    return col

 

 

 

OUTPUT:

ValueError: 0    1.234568e+18
1             NaN
Name: _0, dtype: float64

It seems that when the data is received by the pandas UDF, it is received as float64, which is causing data corruption.

Issue:

Despite casting the column to LongType, the data is still being received as float64 within the UDF, causing precision loss for large integers.

Question:

How can I ensure that the data remains in LongType (or Int64 in pandas) throughout the processing in the pandas UDF, while correctly handling None values?

Any insights or suggestions would be greatly appreciated!

Thank you for your assistance.

Best regards,
Vineet Kumar Chaure

1 REPLY 1

Alberto_Umana
Databricks Employee
Databricks Employee

Hello @vineet_chaure,

By default, Spark converts LongType to float64 when transferring data to pandas. You can use Arrow-optimized pandas UDFs introduced in Apache Spark 3.5.

Please try with below code:

import pandas as pd
import pyarrow as pa
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import LongType

# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Define the schema
schema = StructType([StructField("col1", LongType(), True)])

# Read the data
df = spark.read.csv("/temp/data.csv", header=False, schema=schema)

# Define the Arrow-optimized pandas UDF
@pandas_udf(LongType(), useArrow=True)
def process_data(col: pd.Series) -> pd.Series:
# Convert to nullable integer type
return col.astype(pd.Int64Dtype())

# Apply the UDF
result_df = df.withColumn("col1", process_data(col("col1")))

# Show the result
result_df.show()

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group