High Concurrency Pass Through Cluster : pyarrow op...

Rahul_Samant · ‎01-19-2022

i need to convert a spark dataframe to pandas dataframe with arrow optimization

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

data_df=df.toPandas()

but getting one of the below error randomly while doing so

Exception: arrow is not supported when using file-based collect

OR

/databricks/spark/python/pyspark/sql/pandas/conversion.py:340: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:

[Errno 13] Permission denied: '/local_disk0/spark-*/pyspark-*'

Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

Note: Using high concurrency pass through cluster with 10.0 ML runtime

another problem with Pass through Cluster is not able to load the registered model and make predicitons using spark but have to use pandas mode . getting below error while loading model using udf . is it a limitation of pass through high concurrency cluster as it works in standard cluster ?

predict = mlflow.pyfunc.spark_udf(spark, model_uri)

Exception

PermissionError: [Errno 13] Permission denied: '/databricks/driver'

High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf