- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-19-2022 09:01 AM
Hi everyone,
I want to do some tests regarding data quality and for that I pretend to use PyDeequ on a databricks notebook. Keep in mind that I'm very new to databricks and Spark.
First I created a cluster with the Runtime version "10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)" and added to the environment variable
SPARK_VERSION=3.2
as referred in the repository's GitHub.
Since the available PyPI package is not up to date I tried installing the package through a notebook-scoped library with the following comand
%pip install numpy==1.22
%pip install git+https://github.com/awslabs/python-deequ.git
(The first line is only to prevent a conflict on the numpy versions.)
Then, when doing
import pydeequ
I get
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<command-3386600260354339> in <module>
----> 1 import pydeequ
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
165 # Import the desired module. If you’re seeing this while debugging a failed import,
166 # look at preceding stack frames for relevant error information.
--> 167 original_result = python_builtin_import(name, globals, locals, fromlist, level)
168
169 is_root_import = thread_local._nest_level == 1
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/__init__.py in <module>
19 from pydeequ.analyzers import AnalysisRunner
20 from pydeequ.checks import Check, CheckLevel
---> 21 from pydeequ.configs import DEEQU_MAVEN_COORD
22 from pydeequ.profiles import ColumnProfilerRunner
23
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
165 # Import the desired module. If you’re seeing this while debugging a failed import,
166 # look at preceding stack frames for relevant error information.
--> 167 original_result = python_builtin_import(name, globals, locals, fromlist, level)
168
169 is_root_import = thread_local._nest_level == 1
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in <module>
35
36
---> 37 DEEQU_MAVEN_COORD = _get_deequ_maven_config()
38 IS_DEEQU_V1 = re.search("com\.amazon\.deequ\:deequ\:1.*", DEEQU_MAVEN_COORD) is not None
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_deequ_maven_config()
26
27 def _get_deequ_maven_config():
---> 28 spark_version = _get_spark_version()
29 try:
30 return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_spark_version()
21 ]
22 output = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
---> 23 spark_version = output.stdout.decode().split("\n")[-2]
24 return spark_version
25
IndexError: list index out of range
Can you please help me finding the reason to this or an alternative to get the library without the PyPI.
Thanks in advance!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2022 10:07 AM
I assumed I wouldn't need to add the Deequ library. Apparently, all I had to do was add it via Maven coordinates and it solved the problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-19-2022 06:07 PM
yes this is legit i am also facing the same, I am working on it will update you soon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-19-2022 06:30 PM
hey @Humberto Santos I got this answer
it is happening because the Numpy version is not compatible with your pydeequ
see it is working
numpy==1.20.1 is compatible with this package
Please like this or upvote this answer,you can select this as a best answer also
Thanks
Aviral Bhardwaj
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2022 10:11 AM
That was not the problem. I hadn't installed the Deequ library from Maven
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2022 10:07 AM
I assumed I wouldn't need to add the Deequ library. Apparently, all I had to do was add it via Maven coordinates and it solved the problem.