Databricks

ivanychev · ‎04-19-2022

Using DBR 10.0

When calling toPandas() the worker fails with IndexOutOfBoundsException. It seems like ArrowWriter.sizeInBytes (which looks like a proprietary method since I can't find it in OSS) calls arrow's getBufferSizeFor which fails with this error. What is the root cause of this issue?

Here's a sample of the full stack trace:

java.lang.IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))
at org.apache.arrow.memory.ArrowBuf.checkIndexD(ArrowBuf.java:318)
at org.apache.arrow.memory.ArrowBuf.chk(ArrowBuf.java:305)
at org.apache.arrow.memory.ArrowBuf.getInt(ArrowBuf.java:424)
at org.apache.arrow.vector.complex.BaseRepeatedValueVector.getBufferSizeFor(BaseRepeatedValueVector.java:229)
at org.apache.arrow.vector.complex.ListVector.getBufferSizeFor(ListVector.java:621)
at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.getSizeInBytes(ArrowWriter.scala:165)
at org.apache.spark.sql.execution.arrow.ArrowWriter.sizeInBytes(ArrowWriter.scala:118)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:224)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1647)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:235)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:199)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)

Anonymous · ‎04-19-2022

@Sergey Ivanychev , I think it's trying to return too much data to pandas and overloading the memory. What are you trying to do? You shouldn't need to use pandas much anymore with the 3.2 introduction of pandas API for Spark https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

ivanychev · ‎04-19-2022

I'm feeding the DataFrame to the ML model. The `toPandas()` works perfectly fine with `spark.sql.execution.arrow.pyspark.enabled` set to `false`.

But disabling arrow pipeline by pipeline is far from perfect. The error above doesn't explain a lot and the fail occurs in the proprietary code. At this point I don't know where to look for an error

ivanychev · ‎04-19-2022

Weirdly, `getBufferSizeFor` is the cause of the failure. IMO the method with such a name shouldn't cause out of bounds error.

Hubert-Dudek · ‎04-19-2022

to_pandas() is only for a small dataset.

Please use instead:

to_pandas_on_spark()

It is essential to use Pandas on Spark instead of ordinary Pandas so that it will work in a distributed way. Here is more info https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html

So always import Pandas as:

import pyspark.pandas as ps

ivanychev · ‎04-19-2022

As I noted, `to_pandas() ` works great with `spark.sql.execution.arrow.pyspark.enabled` set to `false`. I understand that to_pandas_on_spark() is an option, but I need a Pandas DataFrame, not a Pandas-on-Spark DataFrame.

Anonymous · ‎04-19-2022

Turning arrow off is going to increase your execution time. It might be better to use something like applyinpandas. You might want to adjust the batch size https://spark.apache.org/docs/3.0.0/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size

ivanychev · ‎04-19-2022

Again, I can't use `applyinpandas` because I need to collect data to feed into an ML model. I need a *Pandas dataframe*.

I have enough memory on my driver (turning off arrow makes the code work).

Anonymous · ‎04-19-2022

applyinpandas takes a function argument, which can be an ML model.

ivanychev · ‎04-19-2022

We train an ML model, not apply it. We need to fetch a batch of data as Pandas dataframe and feed it into a model for training.

Anonymous · ‎04-19-2022

Yes, the ml model training is done with a function such as model.fit().

ivanychev · ‎04-20-2022

I know that. Is my question not clear?

Anonymous · ‎08-11-2022

I have the similar situation.

Kaniz · ‎05-11-2022

Hi @Sergey Ivanychev , Just a friendly follow-up. Do you still need help, or @Hubert Dudek (Customer) and @Joseph Kambourakis 's response help you to find the solution? Please let us know.

Kaniz · ‎06-21-2022

Hi @Sergey Ivanychev , We haven’t heard from you on the last response from me, and I was checking back to see if you found a solution. Or else, If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

Databricks

toPandas() causes IndexOutOfBoundsException in Apache Arrow

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs