cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf

Rahul_Samant
Contributor

i need to convert a spark dataframe to pandas dataframe with arrow optimization

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

data_df=df.toPandas()

but getting one of the below error randomly while doing so

Exception: arrow is not supported when using file-based collect

OR

/databricks/spark/python/pyspark/sql/pandas/conversion.py:340: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:

[Errno 13] Permission denied: '/local_disk0/spark-*/pyspark-*'

Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

Note: Using high concurrency pass through cluster with 10.0 ML runtime

another problem with Pass through Cluster is not able to load the registered model and make predicitons using spark but have to use pandas mode . getting below error while loading model using udf . is it a limitation of pass through high concurrency cluster as it works in standard cluster ?

predict = mlflow.pyfunc.spark_udf(spark, model_uri)

Exception

PermissionError: [Errno 13] Permission denied: '/databricks/driver'

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

You need to use pandas library written on top of spark dataframes. Please use for example:

from pandas import read_csv

from pyspark.pandas import read_csv

pdf = read_csv("data.csv")

more here on blog https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

View solution in original post

4 REPLIES 4

Anonymous
Not applicable

Hello @Rahul Samant​  - My name is Piper, and I'm a moderator for Databricks. Welcome to the community and thanks for asking!

Let's give the community a while to answer before we circle back around to this.

Hubert-Dudek
Esteemed Contributor III

You need to use pandas library written on top of spark dataframes. Please use for example:

from pandas import read_csv

from pyspark.pandas import read_csv

pdf = read_csv("data.csv")

more here on blog https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

Thanks HubertDudek.

I think using the new library has its own limitations for e.g

i tried doing the predictions based on pandas on spark but its giving error as below though it works fine on normal pandas df.

ValueError: Expected 2D array, got 1D array instead:

data_df=df.to_pandas_on_spark()

#procssed_df is generated after feature engineering on df

inputDf=processed_df.to_pandas_on_spark()

data_df['SCORE']=model.decision_function(inputDf.drop('TEST_VAR4',axis=1))

AlexanderBij
New Contributor II

Can you confirm this is a known issue?

Running into same issue, example to test in 1 cell.

# using Arrow fails on HighConcurrency-cluster with PassThrough in runtime 10.4 (and 10.5 and 11.0)
 
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")   # toggle to see difference
df = spark.createDataFrame(sc.parallelize(range(0, 100)), schema="int")
df.toPandas()  # << error here
 
# Msg: arrow is not supported when using file-based collect

It does work on a Personal cluster (Standard / SingleNode) with PassthroughAuth.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!