โ06-10-2022 04:34 AM
Hello Friends,
We have an application which extracts dat from various tables in Azure Databricks and we extract it to postgres tables (postgres installed on top of Azure VMs). After extraction we apply transformation on those datasets in postgres tables with the help of spark programs written on Jupiter notebook and load the data to Neo4j graph database (Neo4j installed on Another Azure VM). For now we are doing the extraction through SQL queries and for transformation on Postgres we are leveraging Python(Spark) Programs. As the there are lot of tables (More than 100) and there is dependency , It is not possible to run everything manually. Hence we are looking for a Orchestrator and Scheduler where we can create our job execution workflow and schedule them to run at a particular time frame. Can you please suggest one ? Appreciate in advance. I am attaching the Architecture of the application here in this post.
โ06-10-2022 04:37 PM
Apache Airflow seems to be the standard kind of tool for this.
โ06-12-2022 11:46 PM
Thanks for your reply @Joseph Kambourakisโ , will explore more on Apache Airflow and try it out
โ06-11-2022 12:58 AM
You should also be able to use Azure Data Factory for orchestration and Scheduling pipelines.
โ06-12-2022 11:48 PM
Thanks for your response @Arvind Ravishโ
โ06-11-2022 03:04 AM
@Badal Pandaโ please consider Databricks Workflows. It's fully-managed, reliable and supports your scenario.
โ06-12-2022 11:48 PM
thanks for your response @Bilal Aslamโ
โ06-17-2022 05:02 AM
Hi @Kaniz Fatmaโ ,
We are trying with Azure Data factory first by migrating our jupyter code to Databricks Notebooks. However the pipeline failed with below error while writing to a particular table in postgres from databricks -
org.apache.spark.SparkException: Job 910 cancelled because Task 30248 in Stage 1422 exceeded the maximum allowed ratio of input to output records (1 to 24919, max allowed 1 to 10000); this limit can be modified with configuration parameter spark.databricks.queryWatchdog.outputRatioThreshold
โ06-18-2022 12:42 AM
Do you have a giant cross join that you are unaware of? or some join condition that is producing many rows in the output...
โ07-29-2022 11:44 AM
Hi @Badal Pandaโ,
Just a friendly follow-up. Do you still looking for help?
This error is coming from high concurrency cluster:
org.apache.spark.SparkException: Job 910 cancelled because Task 30248 in Stage 1422 exceeded the maximum allowed ratio of input to output records (1 to 24919, max allowed 1 to 10000); this limit can be modified with configuration parameter spark.databricks.queryWatchdog.outputRatioThreshold
solution: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/query-watchdog
โ08-01-2022 06:34 AM
Hello @Jose Gonzalezโ ,
Thanks for your response , the issue is resolved.
โ08-18-2022 07:57 AM
Hey there @Badal Pandaโ
Hope you are doing well.
We are glad to hear that you were able to resolve your issue. Would be happy to mark an answer as best so that other members can find the solution more quickly?
Thanks!
โ08-18-2022 09:53 PM
Hi @Vartika Nainโ ,
Sure I can share details regarding the Orchestrator/Scheduler , but recently there have been changes to our design architecture with source systems so let me explain briefly
I hope I have answered your question. Please let me know if there is anything else I can clarify.
โ08-22-2022 10:01 AM
Hey @Badal Pandaโ
Thank you so much for getting back to us. It's really great of you to send in your answer.
We really appreciate your time.
Wish you a great Databricks journey ahead!
โ12-05-2022 09:29 PM
You can leverage Airflow, which provides a connector for databricks jobs API, or can use databricks workflow to orchestrate your jobs where you can define several tasks and set dependencies accordingly.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group