<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/15639#M9947</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have multiple datasets in my data lake that feature valid_from and valid_to columns indicating validity of rows.&lt;/P&gt;&lt;P&gt;If a row is valid currently, this is indicated by valid_to=9999-12-31 00:00:00.&lt;/P&gt;&lt;P&gt;Example:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Example_SCD2"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2437i31E88243BA48E3C0/image-size/large?v=v2&amp;amp;px=999" role="button" title="Example_SCD2" alt="Example_SCD2" /&gt;&lt;/span&gt;Loading this into a Spark dataframe works fine (Spark has no issue with timestamp 9999-12-31).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;However, for analysis and visualization purpose, I would like to do further processing with Pandas instead of Spark. But when trying to convert the dataframe to Pandas an error occurs:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;&lt;I&gt;ArrowInvalid: Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp: 253379592300000000&lt;/I&gt;&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Code for simulating the issue:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import datetime
import pandas as pd
&amp;nbsp;
&amp;nbsp;
df_spark_native = sc.parallelize([
    [1,   'Alice',   datetime.date(1985, 4, 13),   datetime.datetime(1985, 4, 13, 4,5)],
    [2,   'Bob',     datetime.date(9999, 1, 20),   datetime.datetime(9999, 4, 13, 4,5)],
    [3,   'Eve',     datetime.date(1500, 1, 20),   datetime.datetime(1500, 4, 13, 4,5)],
    [3,   'Dave',    datetime.date(   1, 1, 20),   datetime.datetime(   1, 4, 13, 4,5)]
]).toDF(('ID', 'Some_Text', 'Some_Date', 'Some_Timestamp'))
display( df_spark_native )
df_spark_native.printSchema()
&amp;nbsp;
&amp;nbsp;
df_spark_to_pandas = df_spark_native.toPandas()
display( df_spark_to_pandas )&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To me, it appears, that under the hood, spark uses pyarrow to convert the dataframe to pandas.&lt;/P&gt;&lt;P&gt;Pyarrow already has some functionality for handling dates and timestamps that would otherwise cause out of range issue: parameter "&lt;B&gt;timestamp_as_object&lt;/B&gt;" and "&lt;B&gt;date_as_object&lt;/B&gt;" of &lt;A href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas" alt="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas" target="_blank"&gt;&lt;B&gt;pyarrow.Table.to_pandas()&lt;/B&gt;&lt;/A&gt;. However, Spark.toPandas() currently does not allow passing down parameters to pyarrow.&lt;/P&gt;</description>
    <pubDate>Sat, 11 Sep 2021 10:34:17 GMT</pubDate>
    <dc:creator>MartinB</dc:creator>
    <dc:date>2021-09-11T10:34:17Z</dc:date>
    <item>
      <title>Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future</title>
      <link>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/15639#M9947</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have multiple datasets in my data lake that feature valid_from and valid_to columns indicating validity of rows.&lt;/P&gt;&lt;P&gt;If a row is valid currently, this is indicated by valid_to=9999-12-31 00:00:00.&lt;/P&gt;&lt;P&gt;Example:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Example_SCD2"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2437i31E88243BA48E3C0/image-size/large?v=v2&amp;amp;px=999" role="button" title="Example_SCD2" alt="Example_SCD2" /&gt;&lt;/span&gt;Loading this into a Spark dataframe works fine (Spark has no issue with timestamp 9999-12-31).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;However, for analysis and visualization purpose, I would like to do further processing with Pandas instead of Spark. But when trying to convert the dataframe to Pandas an error occurs:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;&lt;I&gt;ArrowInvalid: Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp: 253379592300000000&lt;/I&gt;&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Code for simulating the issue:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import datetime
import pandas as pd
&amp;nbsp;
&amp;nbsp;
df_spark_native = sc.parallelize([
    [1,   'Alice',   datetime.date(1985, 4, 13),   datetime.datetime(1985, 4, 13, 4,5)],
    [2,   'Bob',     datetime.date(9999, 1, 20),   datetime.datetime(9999, 4, 13, 4,5)],
    [3,   'Eve',     datetime.date(1500, 1, 20),   datetime.datetime(1500, 4, 13, 4,5)],
    [3,   'Dave',    datetime.date(   1, 1, 20),   datetime.datetime(   1, 4, 13, 4,5)]
]).toDF(('ID', 'Some_Text', 'Some_Date', 'Some_Timestamp'))
display( df_spark_native )
df_spark_native.printSchema()
&amp;nbsp;
&amp;nbsp;
df_spark_to_pandas = df_spark_native.toPandas()
display( df_spark_to_pandas )&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To me, it appears, that under the hood, spark uses pyarrow to convert the dataframe to pandas.&lt;/P&gt;&lt;P&gt;Pyarrow already has some functionality for handling dates and timestamps that would otherwise cause out of range issue: parameter "&lt;B&gt;timestamp_as_object&lt;/B&gt;" and "&lt;B&gt;date_as_object&lt;/B&gt;" of &lt;A href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas" alt="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas" target="_blank"&gt;&lt;B&gt;pyarrow.Table.to_pandas()&lt;/B&gt;&lt;/A&gt;. However, Spark.toPandas() currently does not allow passing down parameters to pyarrow.&lt;/P&gt;</description>
      <pubDate>Sat, 11 Sep 2021 10:34:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/15639#M9947</guid>
      <dc:creator>MartinB</dc:creator>
      <dc:date>2021-09-11T10:34:17Z</dc:date>
    </item>
    <item>
      <title>Re: Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future</title>
      <link>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/15640#M9948</link>
      <description>&lt;P&gt;Hello @Martin B.​. It's nice to meet you. I'm Piper, one of the community moderators here. Thank you for your question and I'm sorry to hear about the issue. If no one comments soon, please be patient. The team will be back on Monday.&lt;/P&gt;</description>
      <pubDate>Sat, 11 Sep 2021 20:15:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/15640#M9948</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2021-09-11T20:15:46Z</dc:date>
    </item>
    <item>
      <title>Re: Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future</title>
      <link>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/15641#M9949</link>
      <description>&lt;P&gt;Hi @Piper Wilson​&amp;nbsp;, can the team help?&lt;/P&gt;</description>
      <pubDate>Tue, 28 Sep 2021 15:58:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/15641#M9949</guid>
      <dc:creator>MartinB</dc:creator>
      <dc:date>2021-09-28T15:58:10Z</dc:date>
    </item>
    <item>
      <title>Re: Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future</title>
      <link>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/15642#M9950</link>
      <description>&lt;P&gt;@Martin B.​&amp;nbsp;- I apologize for my delayed response. I've pinged the team again. Thanks for your patience.&lt;/P&gt;</description>
      <pubDate>Wed, 29 Sep 2021 15:46:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/15642#M9950</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2021-09-29T15:46:56Z</dc:date>
    </item>
    <item>
      <title>Re: Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future</title>
      <link>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/15643#M9951</link>
      <description>&lt;P&gt;Currently, out of bound timestamps are not supported in pyArrow/pandas. Please refer to the below associated JIRA issue. &lt;/P&gt;&lt;P&gt;&lt;A href="https://issues.apache.org/jira/browse/ARROW-5359?focusedCommentId=17104355&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17104355" target="test_blank"&gt;https://issues.apache.org/jira/browse/ARROW-5359?focusedCommentId=17104355&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17104355&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://issues.apache.org/jira/browse/ARROW-8967" target="test_blank"&gt;https://issues.apache.org/jira/browse/ARROW-8967&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 06 Oct 2021 14:42:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/15643#M9951</guid>
      <dc:creator>shan_chandra</dc:creator>
      <dc:date>2021-10-06T14:42:15Z</dc:date>
    </item>
    <item>
      <title>Re: Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPand</title>
      <link>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/108218#M42994</link>
      <description>&lt;P&gt;Be aware, that in Databricks 15.2 LTS this behavior is broken.&lt;BR /&gt;I cannot find the code, but most likely related to the following option:&lt;BR /&gt;&lt;A href="https://github.com/apache/spark/commit/c1c710e7da75b989f4d14e84e85f336bc10920e0#diff-f9ddcc6cba651c6ebfd34e29ef049c38950cff45ba9f1e461cda315de9b0a56cR149" target="_blank"&gt;https://github.com/apache/spark/commit/c1c710e7da75b989f4d14e84e85f336bc10920e0#diff-f9ddcc6cba651c6ebfd34e29ef049c38950cff45ba9f1e461cda315de9b0a56cR149&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I was able to reproduce the issue locally when the latest pyarrow is installed, with this option enabled.&lt;/P&gt;</description>
      <pubDate>Fri, 31 Jan 2025 22:26:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/interoperability-spark-pandas-can-t-convert-spark-dataframe-to/m-p/108218#M42994</guid>
      <dc:creator>ThePhil</dc:creator>
      <dc:date>2025-01-31T22:26:53Z</dc:date>
    </item>
  </channel>
</rss>

