โ12-09-2024 02:26 PM
Hello community,
I installed databricks extension on my vscode ide. How to fix this error? I created the environment to run locally my notebooks and selected the available remote cluster to execute my notebook, what else?
I Have this error: ImportError: cannot import name 'AnalyzeArgument' from 'pyspark.sql.udtf'
This is the snippet code:
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
spark.sql("SELECT * FROM catalog.00_bronze_layer.client_email LIMIT 10")
โ12-12-2024 02:19 AM
Hi,
We encountered the same issues when importing sql from pyspark in the following code snippet.
from pyspark import sql
def get_spark_session() -> sql.SparkSession:
spark = sql.SparkSession.getActiveSession()
if not spark:
# trying to get a spark connect Sessions
from databricks.connect import DatabricksSession
from pyspark.errors.exceptions.connect import SparkConnectGrpcException
spark = DatabricksSession.builder.getOrCreate()
return spark
Error Encountered:
from pyspark.sql.udtf import AnalyzeArgument, AnalyzeResult # noqa: F401
ImportError: cannot import name 'AnalyzeArgument' from 'pyspark.sql.udtf'
Environment Information:
python 3.11.10 pyspark 3.5.0 databricks-connect 15.4.4
FYI, occasionally deleting and reinstalling the virtual environment can fix the issue, but it's not a consistent solution.
โ12-09-2024 04:39 PM
Hi @jeremy98,
The error you are encountering, ImportError: cannot import name 'AnalyzeArgument' from 'pyspark.sql.udtf'
, is likely due to a version mismatch between the pyspark
library and the databricks-connect
library. This issue arises because the AnalyzeArgument
class is not present in the pyspark
version you are using. Could you please advise which version of pyspark and databricks-connect are you using?
Can you try: pip install --upgrade databricks-connect
โ12-10-2024 01:37 AM - edited โ12-10-2024 01:44 AM
Hello Alberto,
Thanks for your help. Sure, now I upgraded the Databricks-connect to v16.0.0. I was using a pyspark, but how can I find it? I was having only:
pyspark -h
Python 3.13.0 (main, Oct 7 2024, 05:02:14) [Clang 16.0.0 (clang-1600.0.26.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
usage: pyspark [-h] [--remote REMOTE]
In poetry should be 3.5.0
Now, I have another import error: ImportError: cannot import name 'is_remote_only' from 'pyspark.util'
โ12-10-2024 05:39 AM
Hi @jeremy98,
Can you also upgrade pyspark?
pip install --upgrade pyspark
Check if the is_remote_only
function exists in the version of PySpark you are using. You can do this by inspecting the pyspark.util
module:
import pyspark.util
print(dir(pyspark.util))
โ12-12-2024 02:19 AM
Hi,
We encountered the same issues when importing sql from pyspark in the following code snippet.
from pyspark import sql
def get_spark_session() -> sql.SparkSession:
spark = sql.SparkSession.getActiveSession()
if not spark:
# trying to get a spark connect Sessions
from databricks.connect import DatabricksSession
from pyspark.errors.exceptions.connect import SparkConnectGrpcException
spark = DatabricksSession.builder.getOrCreate()
return spark
Error Encountered:
from pyspark.sql.udtf import AnalyzeArgument, AnalyzeResult # noqa: F401
ImportError: cannot import name 'AnalyzeArgument' from 'pyspark.sql.udtf'
Environment Information:
python 3.11.10 pyspark 3.5.0 databricks-connect 15.4.4
FYI, occasionally deleting and reinstalling the virtual environment can fix the issue, but it's not a consistent solution.
โ12-12-2024 06:36 PM
Hi @spiky001,
Could you please advise what is the DBR version of you cluster?
โ12-19-2024 08:55 AM
What version of pyspark is required? I did a clean install and got 3.5.3. I'm running Python 3.11.
$ pip freeze
cachetools==5.5.0
certifi==2024.12.14
charset-normalizer==3.4.0
databricks-connect==16.0.0
databricks-sdk==0.39.0
google-auth==2.37.0
googleapis-common-protos==1.66.0
grpcio==1.68.1
grpcio-status==1.68.1
idna==3.10
numpy==1.26.4
packaging==24.2
pandas==2.2.3
protobuf==5.29.2
py4j==0.10.9.7
pyarrow==18.1.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pyspark==3.5.3
python-dateutil==2.9.0.post0
pytz==2024.2
requests==2.32.3
rsa==4.9
six==1.17.0
tzdata==2024.2
urllib3==2.2.3
I get the error just importing databricks.connect, so I don't see how a cluster property can matter.
$ python
Python 3.11.9 (main, Aug 13 2024, 12:21:18) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import databricks.connect
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/databricks/connect/__init__.py", line 20, in <module>
from .session import DatabricksSession
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/databricks/connect/session.py", line 28, in <module>
from .auth import DatabricksChannelBuilder
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/databricks/connect/auth.py", line 26, in <module>
from pyspark.sql.connect.client import ChannelBuilder
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/__init__.py", line 148, in <module>
from pyspark.sql import SQLContext, HiveContext, Row # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/sql/__init__.py", line 43, in <module>
from pyspark.sql.context import SQLContext, HiveContext, UDFRegistration, UDTFRegistration
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/sql/context.py", line 39, in <module>
from pyspark.sql.session import _monkey_patch_RDD, SparkSession
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/sql/session.py", line 48, in <module>
from pyspark.sql.functions import lit
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/sql/functions/__init__.py", line 20, in <module>
from pyspark.sql.functions.builtin import * # noqa: F401,F403
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/sql/functions/builtin.py", line 50, in <module>
from pyspark.sql.udtf import AnalyzeArgument, AnalyzeResult # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: cannot import name 'AnalyzeArgument' from 'pyspark.sql.udtf' (/home/jim/.pyenv/versions/3.11/lib/python3.11/site-packages/pyspark/sql/udtf.py)
โ12-19-2024 09:31 AM
I wonder if I need to install Java. ๐ I bet I do.
โ12-19-2024 11:30 AM
โ12-19-2024 03:02 PM
Right you are! I actually did install pyspark and that caused the error, until I installed java.
Sorry.
โ12-19-2024 11:42 AM
@unj1m yes, as Alberto said you don't need to install pyspark, it is included in your cluster configuration.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group