cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Job runs indefinitely after integrating with PyDeequ

JD410993
New Contributor II

I'm using PyDeequ data quality checks in one of our jobs.

After adding this check, I noticed that the job does not complete and keeps running indefinitely after PyDeequ checks are completed and results are returned.

As stated in Pydeequ documentation here, I've added the calls below at the end after all processing is done.

spark.sparkContext._gateway.shutdown_callback_server()
spark.stop()

However, the job continues to run and has to be eventually cancelled.

Has anyone else faced this while integrating with pydeequ on databricks.

Would appreciate any pointers.

Thanks.

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

I don't think databricks support it.

-werners-
Esteemed Contributor III

Hm, deequ certainly works as I have read about multiple people using it.

And when reading the issues (open/closed) on the github pages of pydeequ, databricks is mentioned in some issues so it might be possible after all.

But I think you need to check your spark version etc as there is an open issue regarding recent spark versions (https://github.com/awslabs/python-deequ/issues/93).

-werners-
Esteemed Contributor III

to add to this:

do not create your own sparksession or stop it. Databricks handles the sparksession.

Here are the announcements of the pydeequ page:

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.