Databricks

Rahul_Samant · ‎01-19-2022

i need to convert a spark dataframe to pandas dataframe with arrow optimization

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

data_df=df.toPandas()

but getting one of the below error randomly while doing so

Exception: arrow is not supported when using file-based collect

OR

/databricks/spark/python/pyspark/sql/pandas/conversion.py:340: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:

[Errno 13] Permission denied: '/local_disk0/spark-*/pyspark-*'

Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

Note: Using high concurrency pass through cluster with 10.0 ML runtime

another problem with Pass through Cluster is not able to load the registered model and make predicitons using spark but have to use pandas mode . getting below error while loading model using udf . is it a limitation of pass through high concurrency cluster as it works in standard cluster ?

predict = mlflow.pyfunc.spark_udf(spark, model_uri)

Exception

PermissionError: [Errno 13] Permission denied: '/databricks/driver'

Hubert-Dudek · ‎01-20-2022

You need to use pandas library written on top of spark dataframes. Please use for example:

~~from pandas import read_csv~~

from pyspark.pandas import read_csv

pdf = read_csv("data.csv")

more here on blog https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

View solution in original post

Anonymous · ‎01-19-2022

Hello @Rahul Samant - My name is Piper, and I'm a moderator for Databricks. Welcome to the community and thanks for asking!

Let's give the community a while to answer before we circle back around to this.

Hubert-Dudek · ‎01-20-2022

You need to use pandas library written on top of spark dataframes. Please use for example:

~~from pandas import read_csv~~

from pyspark.pandas import read_csv

pdf = read_csv("data.csv")

more here on blog https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

Rahul_Samant · ‎01-20-2022

Thanks HubertDudek.

I think using the new library has its own limitations for e.g

i tried doing the predictions based on pandas on spark but its giving error as below though it works fine on normal pandas df.

ValueError: Expected 2D array, got 1D array instead:

data_df=df.to_pandas_on_spark()

#procssed_df is generated after feature engineering on df

inputDf=processed_df.to_pandas_on_spark()

data_df['SCORE']=model.decision_function(inputDf.drop('TEST_VAR4',axis=1))

AlexanderBij · ‎08-09-2022

Can you confirm this is a known issue?

Running into same issue, example to test in 1 cell.

# using Arrow fails on HighConcurrency-cluster with PassThrough in runtime 10.4 (and 10.5 and 11.0)
 
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")   # toggle to see difference
df = spark.createDataFrame(sc.parallelize(range(0, 100)), schema="int")
df.toPandas()  # << error here
 
# Msg: arrow is not supported when using file-based collect

It does work on a Personal cluster (Standard / SingleNode) with PassthroughAuth.

Databricks

High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI