- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-27-2025 03:16 PM - edited 07-27-2025 03:19 PM
Sure, in many real-world data pipelines, you don’t just process data in one tool like Databricks — instead, you're interacting with a variety of systems at different stages of the pipeline. So, let's say that your workload requires orchestrating following things:
1. S3 File Upload → (AWS S3 Sensor)
2. Load File into Snowflake → (SnowflakeOperator)
3. Run Data Quality Checks → (Custom PythonOperator)
4. Trigger Databricks Notebook → (DatabricksSubmitRunOperator)
5. Push Result to REST API → (HttpOperator)
6.Run some spark job on EMR
7. Send Slack Notification → (SlackWebhookOperator)
As you can see, in above scenario it could be better to use Airflow because it has a rich ecosystem of pre-built operators (Slack, AWS, GCP, Azure, Kuberenetes etc.).
Also, you can write your own operators for custom needs (maybe you need to send some kind of custom notification after workflow succeeds/fails - you can do this by writing your own custom operator or check if already exists that fits your need).
Regarding your second question, not necessarily. Databricks Workflows integrates with dbt core really well (so does Airflow). And product team keep adding tons of new feature at each release.
So, if you don't need really complex orchestration scenario stick do workflows. They're simpler and you don't need to setup whole infrastracture to run it (like you need in case of Airflow).
Otherwise, if you need to handle custom things or orchestrate systems that you cannot in workflows then choose airflow. But as I said, airflow has definitely steeper learning curve.
Use dbt transformations in Lakeflow Jobs | Databricks Documentation