cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Suggestion Needed for a Orchestrator/Scheduler to schedule and execute Jobs in an automated way

BkP
Contributor

Hello Friends,

We have an application which extracts dat from various tables in Azure Databricks and we extract it to postgres tables (postgres installed on top of Azure VMs). After extraction we apply transformation on those datasets in postgres tables with the help of spark programs written on Jupiter notebook and load the data to Neo4j graph database (Neo4j installed on Another Azure VM). For now we are doing the extraction through SQL queries and for transformation on Postgres we are leveraging Python(Spark) Programs. As the there are lot of tables (More than 100) and there is dependency , It is not possible to run everything manually. Hence we are looking for a Orchestrator and Scheduler where we can create our job execution workflow and schedule them to run at a particular time frame. Can you please suggest one ? Appreciate in advance. I am attaching the Architecture of the application here in this post.

image

15 REPLIES 15

Anonymous
Not applicable

Apache Airflow seems to be the standard kind of tool for this.

Thanks for your reply @Joseph Kambourakis​ , will explore more on Apache Airflow and try it out

User16764241763
Honored Contributor

You should also be able to use Azure Data Factory for orchestration and Scheduling pipelines.

https://azure.microsoft.com/en-in/services/data-factory/

Thanks for your response @Arvind Ravish​ 

BilalAslamDbrx
Honored Contributor II
Honored Contributor II

@Badal Panda​ please consider Databricks Workflows. It's fully-managed, reliable and supports your scenario.

thanks for your response @Bilal Aslam​ 

Kaniz
Community Manager
Community Manager

Hi @Badal Panda​, I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

Hi @Kaniz Fatma​ ,

We are trying with Azure Data factory first by migrating our jupyter code to Databricks Notebooks. However the pipeline failed with below error while writing to a particular table in postgres from databricks -

org.apache.spark.SparkException: Job 910 cancelled because Task 30248 in Stage 1422 exceeded the maximum allowed ratio of input to output records (1 to 24919, max allowed 1 to 10000); this limit can be modified with configuration parameter spark.databricks.queryWatchdog.outputRatioThreshold

BilalAslamDbrx
Honored Contributor II
Honored Contributor II

Do you have a giant cross join that you are unaware of? or some join condition that is producing many rows in the output...

Hi @Badal Panda​,

Just a friendly follow-up. Do you still looking for help?

This error is coming from high concurrency cluster:

org.apache.spark.SparkException: Job 910 cancelled because Task 30248 in Stage 1422 exceeded the maximum allowed ratio of input to output records (1 to 24919, max allowed 1 to 10000); this limit can be modified with configuration parameter spark.databricks.queryWatchdog.outputRatioThreshold

solution: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/query-watchdog

BkP
Contributor

Hello @Jose Gonzalez​ ,

Thanks for your response , the issue is resolved.

Anonymous
Not applicable

Hey there @Badal Panda​ 

Hope you are doing well.

We are glad to hear that you were able to resolve your issue. Would be happy to mark an answer as best so that other members can find the solution more quickly?

Thanks!

Hi @Vartika Nain​ ,

Sure I can share details regarding the Orchestrator/Scheduler , but recently there have been changes to our design architecture with source systems so let me explain briefly

  1. For our application ,we initially extracted data from ADLS using databricks notebooks (Spark SQL / Pyspark) and ingested data to postgres and from postgres to Neo4j load we used Jupyter as highlighted in my architecture diagram that I shared here. Based on the suggestion I got here from experts we tried with Apache Airflow but did not get success , with Databricks Jobs we could only schedule notebooks in databricks but not the jobs that were running on Jupyter , We also tried ADF but ADF has no connectors to connect with Jupyter. So we resolved the issue with a Hybrid approach of Databricks Workflows (for all databricks notebooks) + CRONTAB for Jupyter Notebooks. The Jupyter is hosted on a local Linux machine so we used cron jobs with the help of a shell script for all notebooks of Jupyter.
  2. Now our source system is changing and we need to extract from synapse. So we are going to use Azure Data factory as our orchestrator and scheduler for all workloads.

I hope I have answered your question. Please let me know if there is anything else I can clarify.

Anonymous
Not applicable

Hey @Badal Panda​ 

Thank you so much for getting back to us. It's really great of you to send in your answer.

We really appreciate your time.

Wish you a great Databricks journey ahead!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.