topic Re: High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf in Data Engineering

High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf

Rahul_Samant — Wed, 19 Jan 2022 10:20:31 GMT

i need to convert a spark dataframe to pandas dataframe with arrow optimization

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

data_df=df.toPandas()

but getting one of the below error randomly while doing so

Exception: arrow is not supported when using file-based collect

/databricks/spark/python/pyspark/sql/pandas/conversion.py:340: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:

[Errno 13] Permission denied: '/local_disk0/spark-*/pyspark-*'

Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

Note: Using high concurrency pass through cluster with 10.0 ML runtime

another problem with Pass through Cluster is not able to load the registered model and make predicitons using spark but have to use pandas mode . getting below error while loading model using udf . is it a limitation of pass through high concurrency cluster as it works in standard cluster ?

predict = mlflow.pyfunc.spark_udf(spark, model_uri)

Exception

PermissionError: [Errno 13] Permission denied: '/databricks/driver'

Re: High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf

Anonymous — Wed, 19 Jan 2022 16:34:09 GMT

Hello @Rahul Samant - My name is Piper, and I'm a moderator for Databricks. Welcome to the community and thanks for asking!

Let's give the community a while to answer before we circle back around to this.

Re: High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf

Hubert-Dudek — Thu, 20 Jan 2022 10:46:16 GMT

You need to use pandas library written on top of spark dataframes. Please use for example:

~~from pandas import read_csv~~

from pyspark.pandas import read_csv

pdf = read_csv("data.csv")

more here on blog https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

Re: High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf

Rahul_Samant — Fri, 21 Jan 2022 06:23:28 GMT

Thanks HubertDudek.

I think using the new library has its own limitations for e.g

i tried doing the predictions based on pandas on spark but its giving error as below though it works fine on normal pandas df.

ValueError: Expected 2D array, got 1D array instead:

data_df=df.to_pandas_on_spark()

#procssed_df is generated after feature engineering on df

inputDf=processed_df.to_pandas_on_spark()

data_df['SCORE']=model.decision_function(inputDf.drop('TEST_VAR4',axis=1))

Re: High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf

AlexanderBij — Tue, 09 Aug 2022 12:42:26 GMT

Can you confirm this is a known issue?

Running into same issue, example to test in 1 cell.

# using Arrow fails on HighConcurrency-cluster with PassThrough in runtime 10.4 (and 10.5 and 11.0)
 
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")   # toggle to see difference
df = spark.createDataFrame(sc.parallelize(range(0, 100)), schema="int")
df.toPandas()  # << error here
 
# Msg: arrow is not supported when using file-based collect

It does work on a Personal cluster (Standard / SingleNode) with PassthroughAuth.