Re: Handling Large Integers and None Values in pan...

Alberto_Umana · ‎02-11-2025

By default, Spark converts LongType to float64 when transferring data to pandas. You can use Arrow-optimized pandas UDFs introduced in Apache Spark 3.5.

Please try with below code:

import pandas as pd
import pyarrow as pa
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import LongType

# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Define the schema
schema = StructType([StructField("col1", LongType(), True)])

# Read the data
df = spark.read.csv("/temp/data.csv", header=False, schema=schema)

# Define the Arrow-optimized pandas UDF
@pandas_udf(LongType(), useArrow=True)
def process_data(col: pd.Series) -> pd.Series:
# Convert to nullable integer type
return col.astype(pd.Int64Dtype())

# Apply the UDF
result_df = df.withColumn("col1", process_data(col("col1")))

# Show the result
result_df.show()