01-19-2022 02:20 AM
i need to convert a spark dataframe to pandas dataframe with arrow optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
data_df=df.toPandas()
but getting one of the below error randomly while doing so
Exception: arrow is not supported when using file-based collect
OR
/databricks/spark/python/pyspark/sql/pandas/conversion.py:340: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
[Errno 13] Permission denied: '/local_disk0/spark-*/pyspark-*'
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
Note: Using high concurrency pass through cluster with 10.0 ML runtime
another problem with Pass through Cluster is not able to load the registered model and make predicitons using spark but have to use pandas mode . getting below error while loading model using udf . is it a limitation of pass through high concurrency cluster as it works in standard cluster ?
predict = mlflow.pyfunc.spark_udf(spark, model_uri)
Exception
PermissionError: [Errno 13] Permission denied: '/databricks/driver'
01-20-2022 02:46 AM
You need to use pandas library written on top of spark dataframes. Please use for example:
from pandas import read_csv
from pyspark.pandas import read_csv
pdf = read_csv("data.csv")
more here on blog https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html
01-19-2022 08:34 AM
Hello @Rahul Samant - My name is Piper, and I'm a moderator for Databricks. Welcome to the community and thanks for asking!
Let's give the community a while to answer before we circle back around to this.
01-20-2022 02:46 AM
You need to use pandas library written on top of spark dataframes. Please use for example:
from pandas import read_csv
from pyspark.pandas import read_csv
pdf = read_csv("data.csv")
more here on blog https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html
01-20-2022 10:23 PM
Thanks HubertDudek.
I think using the new library has its own limitations for e.g
i tried doing the predictions based on pandas on spark but its giving error as below though it works fine on normal pandas df.
ValueError: Expected 2D array, got 1D array instead:
data_df=df.to_pandas_on_spark()
#procssed_df is generated after feature engineering on df
inputDf=processed_df.to_pandas_on_spark()
data_df['SCORE']=model.decision_function(inputDf.drop('TEST_VAR4',axis=1))
08-09-2022 05:42 AM
Can you confirm this is a known issue?
Running into same issue, example to test in 1 cell.
# using Arrow fails on HighConcurrency-cluster with PassThrough in runtime 10.4 (and 10.5 and 11.0)
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") # toggle to see difference
df = spark.createDataFrame(sc.parallelize(range(0, 100)), schema="int")
df.toPandas() # << error here
# Msg: arrow is not supported when using file-based collect
It does work on a Personal cluster (Standard / SingleNode) with PassthroughAuth.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group