topic Re: ETL pipeline in Data Engineering

ETL pipeline

Yunky007 — Fri, 18 Apr 2025 10:43:07 GMT

I have an ETL pipeline in workflows which I am using to create materialized view. I want to schedule the pipeline for 10 hours only starting from 10 am. How can I schedule that? I can only see hourly basis schedule or cron syntax. I want the compute to be up for 10 hours and then terminate.

Thanks

Yogesh

Re: ETL pipeline

tltharani — Fri, 18 Apr 2025 11:03:09 GMT

Databricks doesn't support duration-based schedules directly, but you can simulate this using cron syntax.
Use This Cron Expression : 0 10-19 * * *
To ensure compute is not running outside of these hours Set Auto-Termination to a low value like 15 mins

Re: ETL pipeline

Isi — Fri, 18 Apr 2025 12:37:14 GMT

Hey @Yunky007

You should use the cron expression 0 10 * * * to start the process at 10 AM.
Then, inside your script, implement a loop or mechanism that keeps the logic running for 10 hours, that’s the trick.

import time from datetime import datetime, timedelta start_time = datetime.now() end_time = start_time + timedelta(hours=10) while datetime.now() < end_time: # Logic spark.sql("REFRESH MATERIALIZED VIEW my_catalog.my_schema.my_mv") # Wait time between executions time.sleep(60 * 60) # 3600 secs = 1 h

Hope this helps 🙂

Isi

Re: ETL pipeline

KaelaniBraster — Mon, 05 May 2025 11:12:10 GMT

Use cron syntax with a stop condition after 10 hours runtime.