<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/high-concurrency-pass-through-cluster-pyarrow-optimization-not/m-p/31377#M22836</link>
    <description>&lt;P&gt;Hello&amp;nbsp;@Rahul Samant​&amp;nbsp;&amp;nbsp;- My name is Piper, and I'm a moderator for Databricks. Welcome to the community and thanks for asking!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Let's give the community a while to answer before we circle back around to this.&lt;/P&gt;</description>
    <pubDate>Wed, 19 Jan 2022 16:34:09 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2022-01-19T16:34:09Z</dc:date>
    <item>
      <title>High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf</title>
      <link>https://community.databricks.com/t5/data-engineering/high-concurrency-pass-through-cluster-pyarrow-optimization-not/m-p/31376#M22835</link>
      <description>&lt;P&gt;i need to convert a spark dataframe to pandas dataframe with arrow optimization &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;spark.conf.set("spark.sql.execution.arrow.enabled", "true")&lt;/P&gt;&lt;P&gt;data_df=df.toPandas()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;but getting one of the below error randomly while doing so &lt;/P&gt;&lt;P&gt;&lt;B&gt;Exception: arrow is not supported when using file-based collect&lt;/B&gt;&lt;/P&gt;&lt;P&gt;OR&lt;/P&gt;&lt;P&gt;&lt;B&gt;/databricks/spark/python/pyspark/sql/pandas/conversion.py:340: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;  [Errno 13] Permission denied: '/local_disk0/spark-*/pyspark-*'&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Note: Using high concurrency  pass through cluster with 10.0 ML runtime&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;another problem with Pass through Cluster is not able to load the registered model and make predicitons using spark but have to use pandas mode . getting below error while loading model using udf  . is it a limitation of pass through high concurrency cluster as it works in standard cluster  ?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;predict = mlflow.pyfunc.spark_udf(spark, model_uri)&lt;/P&gt;&lt;P&gt;&lt;B&gt;Exception&lt;/B&gt;&lt;/P&gt;&lt;P&gt;PermissionError: [Errno 13] Permission denied: '/databricks/driver'&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 19 Jan 2022 10:20:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/high-concurrency-pass-through-cluster-pyarrow-optimization-not/m-p/31376#M22835</guid>
      <dc:creator>Rahul_Samant</dc:creator>
      <dc:date>2022-01-19T10:20:31Z</dc:date>
    </item>
    <item>
      <title>Re: High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf</title>
      <link>https://community.databricks.com/t5/data-engineering/high-concurrency-pass-through-cluster-pyarrow-optimization-not/m-p/31377#M22836</link>
      <description>&lt;P&gt;Hello&amp;nbsp;@Rahul Samant​&amp;nbsp;&amp;nbsp;- My name is Piper, and I'm a moderator for Databricks. Welcome to the community and thanks for asking!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Let's give the community a while to answer before we circle back around to this.&lt;/P&gt;</description>
      <pubDate>Wed, 19 Jan 2022 16:34:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/high-concurrency-pass-through-cluster-pyarrow-optimization-not/m-p/31377#M22836</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-01-19T16:34:09Z</dc:date>
    </item>
    <item>
      <title>Re: High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf</title>
      <link>https://community.databricks.com/t5/data-engineering/high-concurrency-pass-through-cluster-pyarrow-optimization-not/m-p/31378#M22837</link>
      <description>&lt;P&gt;You need to use pandas library written on top of spark dataframes. Please use for example:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;S&gt;from pandas import read_csv&lt;/S&gt;&lt;/P&gt;&lt;P&gt;from pyspark.pandas import read_csv&lt;/P&gt;&lt;P&gt;pdf = read_csv("data.csv")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;more here on blog &lt;A href="https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html" target="test_blank"&gt;https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 20 Jan 2022 10:46:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/high-concurrency-pass-through-cluster-pyarrow-optimization-not/m-p/31378#M22837</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-01-20T10:46:16Z</dc:date>
    </item>
    <item>
      <title>Re: High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf</title>
      <link>https://community.databricks.com/t5/data-engineering/high-concurrency-pass-through-cluster-pyarrow-optimization-not/m-p/31379#M22838</link>
      <description>&lt;P&gt;Thanks HubertDudek.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I think using the new library has its own limitations for e.g &lt;/P&gt;&lt;P&gt;i  tried doing the predictions based on pandas on spark but its giving error as below though it works fine on normal pandas df.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;ValueError: Expected 2D array, got 1D array instead:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data_df=df.to_pandas_on_spark()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;#procssed_df is generated after feature engineering on df&lt;/P&gt;&lt;P&gt;inputDf=processed_df.to_pandas_on_spark()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data_df['SCORE']=model.decision_function(inputDf.drop('TEST_VAR4',axis=1))&lt;/P&gt;</description>
      <pubDate>Fri, 21 Jan 2022 06:23:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/high-concurrency-pass-through-cluster-pyarrow-optimization-not/m-p/31379#M22838</guid>
      <dc:creator>Rahul_Samant</dc:creator>
      <dc:date>2022-01-21T06:23:28Z</dc:date>
    </item>
    <item>
      <title>Re: High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf</title>
      <link>https://community.databricks.com/t5/data-engineering/high-concurrency-pass-through-cluster-pyarrow-optimization-not/m-p/31380#M22839</link>
      <description>&lt;P&gt;Can you confirm this is a known issue?&lt;/P&gt;&lt;P&gt;Running into same issue, example to test in 1 cell.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;# using Arrow fails on HighConcurrency-cluster with PassThrough in runtime 10.4 (and 10.5 and 11.0)
&amp;nbsp;
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")   # toggle to see difference
df = spark.createDataFrame(sc.parallelize(range(0, 100)), schema="int")
df.toPandas()  # &amp;lt;&amp;lt; error here
&amp;nbsp;
# Msg: arrow is not supported when using file-based collect&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;It&lt;I&gt; does&lt;/I&gt; work on a Personal cluster (Standard / SingleNode) with PassthroughAuth.&lt;/P&gt;</description>
      <pubDate>Tue, 09 Aug 2022 12:42:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/high-concurrency-pass-through-cluster-pyarrow-optimization-not/m-p/31380#M22839</guid>
      <dc:creator>AlexanderBij</dc:creator>
      <dc:date>2022-08-09T12:42:26Z</dc:date>
    </item>
  </channel>
</rss>

