cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks with Airflow

sandelic
New Contributor II

Hi there, 

I'm trying to understand the advantages of using Airflow operators to orchestrate Databricks notebooks, given that Databricks already offers its own workflow solution. Could someone please explain the benefits?

Thanks,

Stefan

5 REPLIES 5

szymon_dybczak
Esteemed Contributor III

Hi @sandelic ,

If you workload is mainly Databricks-centered then stick to workflows. They are easy to manage and worfklows directly integrate with Databricks notebooks and jobs.
But sometimes your workload requires complex orchestration and scheduling between many different systems and Airflow was exactly made for this. Airflow allows for extensive customization, you can author and schedule workflows programatically in Python (you can do something similar with DAB, but  Airflow has more options), supports a wide range of integrations with different systems, including cloud platforms, databases, and more.

I would say, if youโ€™re running primarily Spark-based workflows, Databricks Workflows are a great choice. However, if your data pipelines involve several different systems working together, Airflow is probably a better fit for your needs. It has steeper learning curve though.

Thanks for clarifying @szymon_dybczak . Could you elaborate on 'several different systems working together'? Specifically, does this imply that Airflow is recommended when other tools, like dbt-core, are already in use (for instance, if Databricks workflows integrate with dbt-cloud) ?

szymon_dybczak
Esteemed Contributor III

Sure, in many real-world data pipelines, you donโ€™t just process data in one tool like Databricks โ€” instead, you're interacting with a variety of systems at different stages of the pipeline. So, let's say that your workload requires orchestrating following things:

1. S3 File Upload โ†’ (AWS S3 Sensor)
2. Load File into Snowflake โ†’ (SnowflakeOperator)
3. Run Data Quality Checks โ†’ (Custom PythonOperator)
4. Trigger Databricks Notebook โ†’ (DatabricksSubmitRunOperator)
5. Push Result to REST API โ†’ (HttpOperator)
6.Run some spark job on EMR
7. Send Slack Notification โ†’ (SlackWebhookOperator)

As you can see, in above scenario it could be better to use Airflow because it has a rich ecosystem of pre-built operators (Slack, AWS, GCP, Azure, Kuberenetes etc.).
Also, you can write your own operators for custom needs (maybe you need to send some kind of custom notification after workflow succeeds/fails - you can do this by writing your own custom operator or check if already exists that fits your need).

Regarding your second question, not necessarily. Databricks Workflows integrates with dbt core really well (so does Airflow). And  product team keep adding tons of new feature at each release. 
So, if you don't need really complex orchestration scenario stick do workflows. They're simpler and you don't need to setup whole infrastracture to run it (like you need in case of Airflow).
Otherwise, if you need to handle custom things or orchestrate systems that you cannot in workflows then choose airflow. But as I said, airflow has definitely steeper learning curve.

Use dbt transformations in Lakeflow Jobs | Databricks Documentation

Thanks @szymon_dybczak  for the thorough explanation.

szymon_dybczak
Esteemed Contributor III

Hi @sandelic ,

No problem, if the answer was helpful please consider marking it as a solution. This way we help other community members find solution for similiar question faster.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now