12-19-2022 09:01 AM
Hi everyone,
I want to do some tests regarding data quality and for that I pretend to use PyDeequ on a databricks notebook. Keep in mind that I'm very new to databricks and Spark.
First I created a cluster with the Runtime version "10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)" and added to the environment variable
SPARK_VERSION=3.2
as referred in the repository's GitHub.
Since the available PyPI package is not up to date I tried installing the package through a notebook-scoped library with the following comand
%pip install numpy==1.22
%pip install git+https://github.com/awslabs/python-deequ.git
(The first line is only to prevent a conflict on the numpy versions.)
Then, when doing
import pydeequ
I get
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<command-3386600260354339> in <module>
----> 1 import pydeequ
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
165 # Import the desired module. If you’re seeing this while debugging a failed import,
166 # look at preceding stack frames for relevant error information.
--> 167 original_result = python_builtin_import(name, globals, locals, fromlist, level)
168
169 is_root_import = thread_local._nest_level == 1
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/__init__.py in <module>
19 from pydeequ.analyzers import AnalysisRunner
20 from pydeequ.checks import Check, CheckLevel
---> 21 from pydeequ.configs import DEEQU_MAVEN_COORD
22 from pydeequ.profiles import ColumnProfilerRunner
23
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
165 # Import the desired module. If you’re seeing this while debugging a failed import,
166 # look at preceding stack frames for relevant error information.
--> 167 original_result = python_builtin_import(name, globals, locals, fromlist, level)
168
169 is_root_import = thread_local._nest_level == 1
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in <module>
35
36
---> 37 DEEQU_MAVEN_COORD = _get_deequ_maven_config()
38 IS_DEEQU_V1 = re.search("com\.amazon\.deequ\:deequ\:1.*", DEEQU_MAVEN_COORD) is not None
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_deequ_maven_config()
26
27 def _get_deequ_maven_config():
---> 28 spark_version = _get_spark_version()
29 try:
30 return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_spark_version()
21 ]
22 output = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
---> 23 spark_version = output.stdout.decode().split("\n")[-2]
24 return spark_version
25
IndexError: list index out of range
Can you please help me finding the reason to this or an alternative to get the library without the PyPI.
Thanks in advance!
12-20-2022 10:07 AM
I assumed I wouldn't need to add the Deequ library. Apparently, all I had to do was add it via Maven coordinates and it solved the problem.
12-19-2022 06:07 PM
yes this is legit i am also facing the same, I am working on it will update you soon
12-19-2022 06:30 PM
hey @Humberto Santos I got this answer
it is happening because the Numpy version is not compatible with your pydeequ
see it is working
numpy==1.20.1 is compatible with this package
Please like this or upvote this answer,you can select this as a best answer also
Thanks
Aviral Bhardwaj
12-20-2022 10:11 AM
That was not the problem. I hadn't installed the Deequ library from Maven
12-20-2022 10:07 AM
I assumed I wouldn't need to add the Deequ library. Apparently, all I had to do was add it via Maven coordinates and it solved the problem.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group