cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future

MartinB
Contributor III

Hi,

I have multiple datasets in my data lake that feature valid_from and valid_to columns indicating validity of rows.

If a row is valid currently, this is indicated by valid_to=9999-12-31 00:00:00.

Example:

Example_SCD2Loading this into a Spark dataframe works fine (Spark has no issue with timestamp 9999-12-31).

However, for analysis and visualization purpose, I would like to do further processing with Pandas instead of Spark. But when trying to convert the dataframe to Pandas an error occurs:

ArrowInvalid: Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp: 253379592300000000

Code for simulating the issue:

import datetime
import pandas as pd
 
 
df_spark_native = sc.parallelize([
    [1,   'Alice',   datetime.date(1985, 4, 13),   datetime.datetime(1985, 4, 13, 4,5)],
    [2,   'Bob',     datetime.date(9999, 1, 20),   datetime.datetime(9999, 4, 13, 4,5)],
    [3,   'Eve',     datetime.date(1500, 1, 20),   datetime.datetime(1500, 4, 13, 4,5)],
    [3,   'Dave',    datetime.date(   1, 1, 20),   datetime.datetime(   1, 4, 13, 4,5)]
]).toDF(('ID', 'Some_Text', 'Some_Date', 'Some_Timestamp'))
display( df_spark_native )
df_spark_native.printSchema()
 
 
df_spark_to_pandas = df_spark_native.toPandas()
display( df_spark_to_pandas )

To me, it appears, that under the hood, spark uses pyarrow to convert the dataframe to pandas.

Pyarrow already has some functionality for handling dates and timestamps that would otherwise cause out of range issue: parameter "timestamp_as_object" and "date_as_object" of pyarrow.Table.to_pandas(). However, Spark.toPandas() currently does not allow passing down parameters to pyarrow.

1 ACCEPTED SOLUTION

Accepted Solutions

shan_chandra
Esteemed Contributor
Esteemed Contributor

Currently, out of bound timestamps are not supported in pyArrow/pandas. Please refer to the below associated JIRA issue.

https://issues.apache.org/jira/browse/ARROW-5359?focusedCommentId=17104355&page=com.atlassian.jira.p...

https://issues.apache.org/jira/browse/ARROW-8967

View solution in original post

4 REPLIES 4

Anonymous
Not applicable

Hello @Martin B.​. It's nice to meet you. I'm Piper, one of the community moderators here. Thank you for your question and I'm sorry to hear about the issue. If no one comments soon, please be patient. The team will be back on Monday.

MartinB
Contributor III

Hi @Piper Wilson​ , can the team help?

Anonymous
Not applicable

@Martin B.​ - I apologize for my delayed response. I've pinged the team again. Thanks for your patience.

shan_chandra
Esteemed Contributor
Esteemed Contributor

Currently, out of bound timestamps are not supported in pyArrow/pandas. Please refer to the below associated JIRA issue.

https://issues.apache.org/jira/browse/ARROW-5359?focusedCommentId=17104355&page=com.atlassian.jira.p...

https://issues.apache.org/jira/browse/ARROW-8967

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!