<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic {{start_time}} isn't accurate and doesn't behave logically for multi-task jobs in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/start-time-isn-t-accurate-and-doesn-t-behave-logically-for-multi/m-p/22638#M15537</link>
    <description>&lt;P&gt;I am trying to run an incremental data processing job using python wheel.&lt;/P&gt;&lt;P&gt;The job is scheduled to run e.g. every hour.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For my code to know what data increment to process, I inject it with the {{start_time}} as part of the command line, like so&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;["end_date={{start_time}}"]&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I have noticed two things:&lt;/P&gt;&lt;P&gt;* The start_time is seems to refer to when the scheduler actually woke up, vs. when it was &lt;I&gt;meant to wake up&lt;/I&gt;. e.g. instead of passing the exact on-the-hour time, it can contain 2-3 seconds past the hour.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;* When I run a job with two tasks which run one after the other, they each get a different {{start_time}} value. Since scheduling is done on job level vs. task level, and you have a feature for injecting the time to the job, I can't see what is the point of passing different values to each task.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Each one of these behaviors make  {{start_time}} not reliable enough for processing time windows of data.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Other standard schedulers like Airflow and Prefect do pass the correct "planned job trigger time" to the jobs, and are reliable enough for processing time windows.&lt;/P&gt;&lt;P&gt;see &lt;A href="https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#variables" alt="https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#variables" target="_blank"&gt;here&lt;/A&gt; and &lt;A href="https://docs-v1.prefect.io/api/latest/utilities/context.html#context" alt="https://docs-v1.prefect.io/api/latest/utilities/context.html#context" target="_blank"&gt;here&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Can you share what's the best practice for injecting the &lt;I&gt; planned trigger-time&lt;/I&gt; reliably to the job &lt;I&gt;and all of its tasks&lt;/I&gt;?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Sat, 12 Nov 2022 10:54:32 GMT</pubDate>
    <dc:creator>assapin</dc:creator>
    <dc:date>2022-11-12T10:54:32Z</dc:date>
    <item>
      <title>{{start_time}} isn't accurate and doesn't behave logically for multi-task jobs</title>
      <link>https://community.databricks.com/t5/data-engineering/start-time-isn-t-accurate-and-doesn-t-behave-logically-for-multi/m-p/22638#M15537</link>
      <description>&lt;P&gt;I am trying to run an incremental data processing job using python wheel.&lt;/P&gt;&lt;P&gt;The job is scheduled to run e.g. every hour.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For my code to know what data increment to process, I inject it with the {{start_time}} as part of the command line, like so&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;["end_date={{start_time}}"]&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I have noticed two things:&lt;/P&gt;&lt;P&gt;* The start_time is seems to refer to when the scheduler actually woke up, vs. when it was &lt;I&gt;meant to wake up&lt;/I&gt;. e.g. instead of passing the exact on-the-hour time, it can contain 2-3 seconds past the hour.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;* When I run a job with two tasks which run one after the other, they each get a different {{start_time}} value. Since scheduling is done on job level vs. task level, and you have a feature for injecting the time to the job, I can't see what is the point of passing different values to each task.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Each one of these behaviors make  {{start_time}} not reliable enough for processing time windows of data.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Other standard schedulers like Airflow and Prefect do pass the correct "planned job trigger time" to the jobs, and are reliable enough for processing time windows.&lt;/P&gt;&lt;P&gt;see &lt;A href="https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#variables" alt="https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#variables" target="_blank"&gt;here&lt;/A&gt; and &lt;A href="https://docs-v1.prefect.io/api/latest/utilities/context.html#context" alt="https://docs-v1.prefect.io/api/latest/utilities/context.html#context" target="_blank"&gt;here&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Can you share what's the best practice for injecting the &lt;I&gt; planned trigger-time&lt;/I&gt; reliably to the job &lt;I&gt;and all of its tasks&lt;/I&gt;?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 12 Nov 2022 10:54:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/start-time-isn-t-accurate-and-doesn-t-behave-logically-for-multi/m-p/22638#M15537</guid>
      <dc:creator>assapin</dc:creator>
      <dc:date>2022-11-12T10:54:32Z</dc:date>
    </item>
  </channel>
</rss>

