09-11-2021 03:34 AM
Hi,
I have multiple datasets in my data lake that feature valid_from and valid_to columns indicating validity of rows.
If a row is valid currently, this is indicated by valid_to=9999-12-31 00:00:00.
Example:
Loading this into a Spark dataframe works fine (Spark has no issue with timestamp 9999-12-31).
However, for analysis and visualization purpose, I would like to do further processing with Pandas instead of Spark. But when trying to convert the dataframe to Pandas an error occurs:
ArrowInvalid: Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp: 253379592300000000
Code for simulating the issue:
import datetime
import pandas as pd
df_spark_native = sc.parallelize([
[1, 'Alice', datetime.date(1985, 4, 13), datetime.datetime(1985, 4, 13, 4,5)],
[2, 'Bob', datetime.date(9999, 1, 20), datetime.datetime(9999, 4, 13, 4,5)],
[3, 'Eve', datetime.date(1500, 1, 20), datetime.datetime(1500, 4, 13, 4,5)],
[3, 'Dave', datetime.date( 1, 1, 20), datetime.datetime( 1, 4, 13, 4,5)]
]).toDF(('ID', 'Some_Text', 'Some_Date', 'Some_Timestamp'))
display( df_spark_native )
df_spark_native.printSchema()
df_spark_to_pandas = df_spark_native.toPandas()
display( df_spark_to_pandas )
To me, it appears, that under the hood, spark uses pyarrow to convert the dataframe to pandas.
Pyarrow already has some functionality for handling dates and timestamps that would otherwise cause out of range issue: parameter "timestamp_as_object" and "date_as_object" of pyarrow.Table.to_pandas(). However, Spark.toPandas() currently does not allow passing down parameters to pyarrow.
10-06-2021 07:42 AM
Currently, out of bound timestamps are not supported in pyArrow/pandas. Please refer to the below associated JIRA issue.
09-11-2021 01:15 PM
Hello @Martin B.. It's nice to meet you. I'm Piper, one of the community moderators here. Thank you for your question and I'm sorry to hear about the issue. If no one comments soon, please be patient. The team will be back on Monday.
09-28-2021 08:58 AM
Hi @Piper Wilson , can the team help?
09-29-2021 08:46 AM
@Martin B. - I apologize for my delayed response. I've pinged the team again. Thanks for your patience.
10-06-2021 07:42 AM
Currently, out of bound timestamps are not supported in pyArrow/pandas. Please refer to the below associated JIRA issue.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group