cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Job runs indefinitely after integrating with PyDeequ

JD410993
New Contributor II

I'm using PyDeequ data quality checks in one of our jobs.

After adding this check, I noticed that the job does not complete and keeps running indefinitely after PyDeequ checks are completed and results are returned.

As stated in Pydeequ documentation here, I've added the calls below at the end after all processing is done.

spark.sparkContext._gateway.shutdown_callback_server()
spark.stop()

However, the job continues to run and has to be eventually cancelled.

Has anyone else faced this while integrating with pydeequ on databricks.

Would appreciate any pointers.

Thanks.

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

I don't think databricks support it.

-werners-
Esteemed Contributor III

Hm, deequ certainly works as I have read about multiple people using it.

And when reading the issues (open/closed) on the github pages of pydeequ, databricks is mentioned in some issues so it might be possible after all.

But I think you need to check your spark version etc as there is an open issue regarding recent spark versions (https://github.com/awslabs/python-deequ/issues/93).

-werners-
Esteemed Contributor III

to add to this:

do not create your own sparksession or stop it. Databricks handles the sparksession.

Here are the announcements of the pydeequ page:

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!