Databricks Community

jeremy98 · ‎12-09-2024

Hello community,
I installed databricks extension on my vscode ide. How to fix this error? I created the environment to run locally my notebooks and selected the available remote cluster to execute my notebook, what else?

I Have this error: ImportError: cannot import name 'AnalyzeArgument' from 'pyspark.sql.udtf'

This is the snippet code:

from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()

spark.sql("SELECT * FROM catalog.00_bronze_layer.client_email LIMIT 10")

spiky001 · ‎12-12-2024

Hi,

We encountered the same issues when importing sql from pyspark in the following code snippet.

from pyspark import sql

def get_spark_session() -> sql.SparkSession:
    spark = sql.SparkSession.getActiveSession()
    if not spark:
        # trying to get a spark connect Sessions
        from databricks.connect import DatabricksSession
        from pyspark.errors.exceptions.connect import SparkConnectGrpcException
        spark = DatabricksSession.builder.getOrCreate()
    return spark

Error Encountered:

from pyspark.sql.udtf import AnalyzeArgument, AnalyzeResult  # noqa: F401
ImportError: cannot import name 'AnalyzeArgument' from 'pyspark.sql.udtf'

Environment Information:

python                   3.11.10
pyspark                  3.5.0
databricks-connect       15.4.4

FYI, occasionally deleting and reinstalling the virtual environment can fix the issue, but it's not a consistent solution.

View solution in original post

Alberto_Umana · ‎12-09-2024

Hi @jeremy98,

The error you are encountering, ImportError: cannot import name 'AnalyzeArgument' from 'pyspark.sql.udtf', is likely due to a version mismatch between the pyspark library and the databricks-connect library. This issue arises because the AnalyzeArgument class is not present in the pyspark version you are using. Could you please advise which version of pyspark and databricks-connect are you using?
Can you try: pip install --upgrade databricks-connect

jeremy98 · ‎12-10-2024

Hello Alberto,
Thanks for your help. Sure, now I upgraded the Databricks-connect to v16.0.0. I was using a pyspark, but how can I find it? I was having only:

pyspark -h                
Python 3.13.0 (main, Oct  7 2024, 05:02:14) [Clang 16.0.0 (clang-1600.0.26.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
usage: pyspark [-h] [--remote REMOTE]

In poetry should be 3.5.0

Now, I have another import error: ImportError: cannot import name 'is_remote_only' from 'pyspark.util'

Alberto_Umana · ‎12-10-2024

Hi @jeremy98,

Can you also upgrade pyspark?

pip install --upgrade pyspark

Check if the is_remote_only function exists in the version of PySpark you are using. You can do this by inspecting the pyspark.util module:
```
import pyspark.util
print(dir(pyspark.util))
```

spiky001 · ‎12-12-2024

Hi,

We encountered the same issues when importing sql from pyspark in the following code snippet.

from pyspark import sql

def get_spark_session() -> sql.SparkSession:
    spark = sql.SparkSession.getActiveSession()
    if not spark:
        # trying to get a spark connect Sessions
        from databricks.connect import DatabricksSession
        from pyspark.errors.exceptions.connect import SparkConnectGrpcException
        spark = DatabricksSession.builder.getOrCreate()
    return spark

Error Encountered:

from pyspark.sql.udtf import AnalyzeArgument, AnalyzeResult  # noqa: F401
ImportError: cannot import name 'AnalyzeArgument' from 'pyspark.sql.udtf'

Environment Information:

python                   3.11.10
pyspark                  3.5.0
databricks-connect       15.4.4

FYI, occasionally deleting and reinstalling the virtual environment can fix the issue, but it's not a consistent solution.

Alberto_Umana · ‎12-12-2024

Hi @spiky001,

Could you please advise what is the DBR version of you cluster?

unj1m · ‎12-19-2024

What version of pyspark is required? I did a clean install and got 3.5.3. I'm running Python 3.11.

$ pip freeze
cachetools==5.5.0
certifi==2024.12.14
charset-normalizer==3.4.0
databricks-connect==16.0.0
databricks-sdk==0.39.0
google-auth==2.37.0
googleapis-common-protos==1.66.0
grpcio==1.68.1
grpcio-status==1.68.1
idna==3.10
numpy==1.26.4
packaging==24.2
pandas==2.2.3
protobuf==5.29.2
py4j==0.10.9.7
pyarrow==18.1.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pyspark==3.5.3
python-dateutil==2.9.0.post0
pytz==2024.2
requests==2.32.3
rsa==4.9
six==1.17.0
tzdata==2024.2
urllib3==2.2.3

I get the error just importing databricks.connect, so I don't see how a cluster property can matter.

$ python
Python 3.11.9 (main, Aug 13 2024, 12:21:18) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import databricks.connect
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/databricks/connect/__init__.py", line 20, in <module>
from .session import DatabricksSession
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/databricks/connect/session.py", line 28, in <module>
from .auth import DatabricksChannelBuilder
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/databricks/connect/auth.py", line 26, in <module>
from pyspark.sql.connect.client import ChannelBuilder
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/__init__.py", line 148, in <module>
from pyspark.sql import SQLContext, HiveContext, Row # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/sql/__init__.py", line 43, in <module>
from pyspark.sql.context import SQLContext, HiveContext, UDFRegistration, UDTFRegistration
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/sql/context.py", line 39, in <module>
from pyspark.sql.session import _monkey_patch_RDD, SparkSession
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/sql/session.py", line 48, in <module>
from pyspark.sql.functions import lit
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/sql/functions/__init__.py", line 20, in <module>
from pyspark.sql.functions.builtin import * # noqa: F401,F403
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/sql/functions/builtin.py", line 50, in <module>
from pyspark.sql.udtf import AnalyzeArgument, AnalyzeResult # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: cannot import name 'AnalyzeArgument' from 'pyspark.sql.udtf' (/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/sql/udtf.py)