<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22622#M15521</link>
    <description>&lt;P&gt;Weirdly, `getBufferSizeFor` is the cause of the failure. IMO the method with such a name shouldn't cause out of bounds error.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 19 Apr 2022 19:21:39 GMT</pubDate>
    <dc:creator>ivanychev</dc:creator>
    <dc:date>2022-04-19T19:21:39Z</dc:date>
    <item>
      <title>toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22619#M15518</link>
      <description>&lt;P&gt;Using DBR 10.0&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When calling toPandas() the worker fails with IndexOutOfBoundsException. It seems like ArrowWriter.sizeInBytes (which looks like a proprietary method since I can't find it in OSS) calls arrow's getBufferSizeFor which fails with this error. What is the root cause of this issue?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Here's a sample of the full stack trace:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;java.lang.IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))
at org.apache.arrow.memory.ArrowBuf.checkIndexD(ArrowBuf.java:318)
at org.apache.arrow.memory.ArrowBuf.chk(ArrowBuf.java:305)
at org.apache.arrow.memory.ArrowBuf.getInt(ArrowBuf.java:424)
at org.apache.arrow.vector.complex.BaseRepeatedValueVector.getBufferSizeFor(BaseRepeatedValueVector.java:229)
at org.apache.arrow.vector.complex.ListVector.getBufferSizeFor(ListVector.java:621)
at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.getSizeInBytes(ArrowWriter.scala:165)
at org.apache.spark.sql.execution.arrow.ArrowWriter.sizeInBytes(ArrowWriter.scala:118)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:224)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1647)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:235)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:199)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 21 Mar 2025 13:30:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22619#M15518</guid>
      <dc:creator>ivanychev</dc:creator>
      <dc:date>2025-03-21T13:30:54Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22620#M15519</link>
      <description>&lt;P&gt;@Sergey Ivanychev​&amp;nbsp;, I think it's trying to return too much data to pandas and overloading the memory. What are you trying to do? You shouldn't need to use pandas much anymore with the 3.2 introduction of pandas API for Spark &lt;A href="https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html" alt="https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html" target="_blank"&gt;https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html&lt;/A&gt; &lt;/P&gt;</description>
      <pubDate>Tue, 19 Apr 2022 18:14:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22620#M15519</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-04-19T18:14:33Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22621#M15520</link>
      <description>&lt;P&gt;I'm feeding the DataFrame to the ML model. The `toPandas()` works perfectly fine with `spark.sql.execution.arrow.pyspark.enabled` set to `false`.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;But disabling arrow pipeline by pipeline is far from perfect. The error above doesn't explain a lot and the fail occurs in the proprietary code. At this point I don't know where to look for an error&lt;/P&gt;</description>
      <pubDate>Tue, 19 Apr 2022 19:17:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22621#M15520</guid>
      <dc:creator>ivanychev</dc:creator>
      <dc:date>2022-04-19T19:17:50Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22622#M15521</link>
      <description>&lt;P&gt;Weirdly, `getBufferSizeFor` is the cause of the failure. IMO the method with such a name shouldn't cause out of bounds error.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Apr 2022 19:21:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22622#M15521</guid>
      <dc:creator>ivanychev</dc:creator>
      <dc:date>2022-04-19T19:21:39Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22623#M15522</link>
      <description>&lt;P&gt;to_pandas() is only for a small dataset.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please use instead:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;to_pandas_on_spark()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;It is essential to use Pandas on Spark instead of ordinary Pandas so that it will work in a distributed way. Here is more info &lt;A href="https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html" target="test_blank"&gt;https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So always import Pandas as:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import pyspark.pandas as ps&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Apr 2022 19:28:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22623#M15522</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-04-19T19:28:00Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22624#M15523</link>
      <description>&lt;P&gt;As I noted, `to_pandas()&amp;nbsp;`  works great with `spark.sql.execution.arrow.pyspark.enabled` set to `false`. I understand that to_pandas_on_spark() is an option, but I need a Pandas DataFrame, not a Pandas-on-Spark DataFrame.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Apr 2022 19:33:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22624#M15523</guid>
      <dc:creator>ivanychev</dc:creator>
      <dc:date>2022-04-19T19:33:12Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22625#M15524</link>
      <description>&lt;P&gt;Turning arrow off is going to increase your execution time.  It might be better to use something like applyinpandas.  You might want to adjust the batch size &lt;A href="https://spark.apache.org/docs/3.0.0/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size" target="test_blank"&gt;https://spark.apache.org/docs/3.0.0/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Apr 2022 19:50:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22625#M15524</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-04-19T19:50:18Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22626#M15525</link>
      <description>&lt;P&gt;Again, I can't use `applyinpandas` because I need to collect data to feed into an ML model. I need a *Pandas dataframe*.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have enough memory on my driver (turning off arrow makes the code work).&lt;/P&gt;</description>
      <pubDate>Tue, 19 Apr 2022 19:57:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22626#M15525</guid>
      <dc:creator>ivanychev</dc:creator>
      <dc:date>2022-04-19T19:57:49Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22627#M15526</link>
      <description>&lt;P&gt;applyinpandas takes a function argument, which can be an ML model.  &lt;/P&gt;</description>
      <pubDate>Tue, 19 Apr 2022 20:23:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22627#M15526</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-04-19T20:23:19Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22628#M15527</link>
      <description>&lt;P&gt;We train an ML model, not apply it. We need to fetch a batch of data as Pandas dataframe and feed it into a model for training. &lt;/P&gt;</description>
      <pubDate>Tue, 19 Apr 2022 20:29:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22628#M15527</guid>
      <dc:creator>ivanychev</dc:creator>
      <dc:date>2022-04-19T20:29:49Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22629#M15528</link>
      <description>&lt;P&gt;Yes, the ml model training is done with a function such as model.fit().&lt;/P&gt;</description>
      <pubDate>Tue, 19 Apr 2022 20:39:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22629#M15528</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-04-19T20:39:35Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22630#M15529</link>
      <description>&lt;P&gt;I know that. Is my question not clear?&lt;/P&gt;</description>
      <pubDate>Wed, 20 Apr 2022 08:29:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22630#M15529</guid>
      <dc:creator>ivanychev</dc:creator>
      <dc:date>2022-04-20T08:29:32Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22633#M15532</link>
      <description>&lt;P&gt;I have the similar situation.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 12 Aug 2022 06:28:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22633#M15532</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-08-12T06:28:25Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22634#M15533</link>
      <description>&lt;P&gt;This could be a Arrow version mismatch. Do you by chance try to install anything that could install a different arrow version? it can happen indirectly via other libs.&lt;/P&gt;</description>
      <pubDate>Mon, 19 Sep 2022 22:41:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22634#M15533</guid>
      <dc:creator>sean_owen</dc:creator>
      <dc:date>2022-09-19T22:41:56Z</dc:date>
    </item>
    <item>
      <title>Re: toPandas() causes IndexOutOfBoundsException in Apache Arrow</title>
      <link>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22635#M15534</link>
      <description>&lt;P&gt;I am also facing the same issue, I have applied the config: `spark.sql.execution.arrow.pyspark.enabled` set to `false`, but still facing the same issue. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any Idea, what's going on???. Please help me out....&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 39.0 failed 4 times, most recent failure: Lost task 0.3 in stage 39.0 (TID 3789) (10.132.234.41 executor 39): java.lang.IndexOutOfBoundsException: index: 2147483640, length: 174 (expected: range(0, 2147483648))
	at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699)
	at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:890)
	at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1087)
	at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287)
	at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151)
	at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$ArrowWriterThread.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:110)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1657)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$ArrowWriterThread.writeIteratorToStream(ArrowPythonRunner.scala:132)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:521)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2241)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:313)
&amp;nbsp;
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2873)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2820)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2814)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2814)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1350)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1350)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1350)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3081)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3022)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3010)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: java.lang.IndexOutOfBoundsException: index: 2147483640, length: 174 (expected: range(0, 2147483648))
	at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699)
	at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:890)
	at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1087)
	at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287)
	at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151)
	at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$ArrowWriterThread.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:110)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1657)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$ArrowWriterThread.writeIteratorToStream(ArrowPythonRunner.scala:132)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:521)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2241)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:313)
&amp;nbsp;
&amp;nbsp;
=== Streaming Query ===
Identifier: [id = 1f85f00f-6e6f-4b42-b178-0fe871f8ec02, runId = 46d257c6-3992-40bc-9353-7d8bb161925c]
Current Committed Offsets: {}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 11 Dec 2022 14:36:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/topandas-causes-indexoutofboundsexception-in-apache-arrow/m-p/22635#M15534</guid>
      <dc:creator>vikas_ahlawat</dc:creator>
      <dc:date>2022-12-11T14:36:50Z</dc:date>
    </item>
  </channel>
</rss>

