cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Error when importing PyDeequ package

hf_santos
New Contributor III

Hi everyone,

I want to do some tests regarding data quality and for that I pretend to use PyDeequ on a databricks notebook. Keep in mind that I'm very new to databricks and Spark.

First I created a cluster with the Runtime version "10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)" and added to the environment variable

SPARK_VERSION=3.2

as referred in the repository's GitHub.

Since the available PyPI package is not up to date I tried installing the package through a notebook-scoped library with the following comand

%pip install numpy==1.22
%pip install git+https://github.com/awslabs/python-deequ.git

(The first line is only to prevent a conflict on the numpy versions.)

Then, when doing

import pydeequ

I get

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<command-3386600260354339> in <module>
----> 1 import pydeequ
 
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
    165             # Import the desired module. If you’re seeing this while debugging a failed import,
    166             # look at preceding stack frames for relevant error information.
--> 167             original_result = python_builtin_import(name, globals, locals, fromlist, level)
    168 
    169             is_root_import = thread_local._nest_level == 1
 
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/__init__.py in <module>
     19 from pydeequ.analyzers import AnalysisRunner
     20 from pydeequ.checks import Check, CheckLevel
---> 21 from pydeequ.configs import DEEQU_MAVEN_COORD
     22 from pydeequ.profiles import ColumnProfilerRunner
     23 
 
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
    165             # Import the desired module. If you’re seeing this while debugging a failed import,
    166             # look at preceding stack frames for relevant error information.
--> 167             original_result = python_builtin_import(name, globals, locals, fromlist, level)
    168 
    169             is_root_import = thread_local._nest_level == 1
 
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in <module>
     35 
     36 
---> 37 DEEQU_MAVEN_COORD = _get_deequ_maven_config()
     38 IS_DEEQU_V1 = re.search("com\.amazon\.deequ\:deequ\:1.*", DEEQU_MAVEN_COORD) is not None
 
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_deequ_maven_config()
     26 
     27 def _get_deequ_maven_config():
---> 28     spark_version = _get_spark_version()
     29     try:
     30         return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]
 
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_spark_version()
     21     ]
     22     output = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
---> 23     spark_version = output.stdout.decode().split("\n")[-2]
     24     return spark_version
     25 
 
IndexError: list index out of range

Can you please help me finding the reason to this or an alternative to get the library without the PyPI.

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

hf_santos
New Contributor III

I assumed I wouldn't need to add the Deequ library. Apparently, all I had to do was add it via Maven coordinates and it solved the problem.

View solution in original post

4 REPLIES 4

Aviral-Bhardwaj
Esteemed Contributor III

yes this is legit i am also facing the same, I am working on it will update you soon

AviralBhardwaj

Aviral-Bhardwaj
Esteemed Contributor III

hey @Humberto Santos​  I got this answer

it is happening because the Numpy version is not compatible with your pydeequ

see it is working

image 

numpy==1.20.1 is compatible with this package

Please like this or upvote this answer,you can select this as a best answer also

Thanks

Aviral Bhardwaj

AviralBhardwaj

That was not the problem. I hadn't installed the Deequ library from Maven

hf_santos
New Contributor III

I assumed I wouldn't need to add the Deequ library. Apparently, all I had to do was add it via Maven coordinates and it solved the problem.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group