cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Error when importing PyDeequ package

hf_santos
New Contributor III

Hi everyone,

I want to do some tests regarding data quality and for that I pretend to use PyDeequ on a databricks notebook. Keep in mind that I'm very new to databricks and Spark.

First I created a cluster with the Runtime version "10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)" and added to the environment variable

SPARK_VERSION=3.2

as referred in the repository's GitHub.

Since the available PyPI package is not up to date I tried installing the package through a notebook-scoped library with the following comand

%pip install numpy==1.22
%pip install git+https://github.com/awslabs/python-deequ.git

(The first line is only to prevent a conflict on the numpy versions.)

Then, when doing

import pydeequ

I get

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<command-3386600260354339> in <module>
----> 1 import pydeequ
 
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
    165             # Import the desired module. If you’re seeing this while debugging a failed import,
    166             # look at preceding stack frames for relevant error information.
--> 167             original_result = python_builtin_import(name, globals, locals, fromlist, level)
    168 
    169             is_root_import = thread_local._nest_level == 1
 
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/__init__.py in <module>
     19 from pydeequ.analyzers import AnalysisRunner
     20 from pydeequ.checks import Check, CheckLevel
---> 21 from pydeequ.configs import DEEQU_MAVEN_COORD
     22 from pydeequ.profiles import ColumnProfilerRunner
     23 
 
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
    165             # Import the desired module. If you’re seeing this while debugging a failed import,
    166             # look at preceding stack frames for relevant error information.
--> 167             original_result = python_builtin_import(name, globals, locals, fromlist, level)
    168 
    169             is_root_import = thread_local._nest_level == 1
 
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in <module>
     35 
     36 
---> 37 DEEQU_MAVEN_COORD = _get_deequ_maven_config()
     38 IS_DEEQU_V1 = re.search("com\.amazon\.deequ\:deequ\:1.*", DEEQU_MAVEN_COORD) is not None
 
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_deequ_maven_config()
     26 
     27 def _get_deequ_maven_config():
---> 28     spark_version = _get_spark_version()
     29     try:
     30         return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]
 
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_spark_version()
     21     ]
     22     output = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
---> 23     spark_version = output.stdout.decode().split("\n")[-2]
     24     return spark_version
     25 
 
IndexError: list index out of range

Can you please help me finding the reason to this or an alternative to get the library without the PyPI.

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

hf_santos
New Contributor III

I assumed I wouldn't need to add the Deequ library. Apparently, all I had to do was add it via Maven coordinates and it solved the problem.

View solution in original post

4 REPLIES 4

Aviral-Bhardwaj
Esteemed Contributor III

yes this is legit i am also facing the same, I am working on it will update you soon

Aviral-Bhardwaj
Esteemed Contributor III

hey @Humberto Santos​  I got this answer

it is happening because the Numpy version is not compatible with your pydeequ

see it is working

image 

numpy==1.20.1 is compatible with this package

Please like this or upvote this answer,you can select this as a best answer also

Thanks

Aviral Bhardwaj

That was not the problem. I hadn't installed the Deequ library from Maven

hf_santos
New Contributor III

I assumed I wouldn't need to add the Deequ library. Apparently, all I had to do was add it via Maven coordinates and it solved the problem.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.