cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Job runs indefinitely after integrating with PyDeequ

JD410993
New Contributor II

I'm using PyDeequ data quality checks in one of our jobs.

After adding this check, I noticed that the job does not complete and keeps running indefinitely after PyDeequ checks are completed and results are returned.

As stated in Pydeequ documentation here, I've added the calls below at the end after all processing is done.

spark.sparkContext._gateway.shutdown_callback_server()
spark.stop()

However, the job continues to run and has to be eventually cancelled.

Has anyone else faced this while integrating with pydeequ on databricks.

Would appreciate any pointers.

Thanks.

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

I don't think databricks support it.

-werners-
Esteemed Contributor III

Hm, deequ certainly works as I have read about multiple people using it.

And when reading the issues (open/closed) on the github pages of pydeequ, databricks is mentioned in some issues so it might be possible after all.

But I think you need to check your spark version etc as there is an open issue regarding recent spark versions (https://github.com/awslabs/python-deequ/issues/93).

-werners-
Esteemed Contributor III

to add to this:

do not create your own sparksession or stop it. Databricks handles the sparksession.

Here are the announcements of the pydeequ page:

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group