<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic differences between notebooks and notebooks that run inside a job in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/differences-between-notebooks-and-notebooks-that-run-inside-a/m-p/114529#M44860</link>
    <description>&lt;P class=""&gt;&lt;STRONG&gt;Hello Community,&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;I'm facing an issue with a job that runs a notebook task. When I run the same join condition through the job pipeline, it produces different results compared to running the notebook interactively (outside the job).&lt;/P&gt;&lt;P class=""&gt;Why might this be happening? Could there be differences in how timestamp columns are handled between jobs and interactive notebook runs?&lt;/P&gt;</description>
    <pubDate>Fri, 04 Apr 2025 14:35:03 GMT</pubDate>
    <dc:creator>jeremy98</dc:creator>
    <dc:date>2025-04-04T14:35:03Z</dc:date>
    <item>
      <title>differences between notebooks and notebooks that run inside a job</title>
      <link>https://community.databricks.com/t5/data-engineering/differences-between-notebooks-and-notebooks-that-run-inside-a/m-p/114529#M44860</link>
      <description>&lt;P class=""&gt;&lt;STRONG&gt;Hello Community,&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;I'm facing an issue with a job that runs a notebook task. When I run the same join condition through the job pipeline, it produces different results compared to running the notebook interactively (outside the job).&lt;/P&gt;&lt;P class=""&gt;Why might this be happening? Could there be differences in how timestamp columns are handled between jobs and interactive notebook runs?&lt;/P&gt;</description>
      <pubDate>Fri, 04 Apr 2025 14:35:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/differences-between-notebooks-and-notebooks-that-run-inside-a/m-p/114529#M44860</guid>
      <dc:creator>jeremy98</dc:creator>
      <dc:date>2025-04-04T14:35:03Z</dc:date>
    </item>
    <item>
      <title>Re: differences between notebooks and notebooks that run inside a job</title>
      <link>https://community.databricks.com/t5/data-engineering/differences-between-notebooks-and-notebooks-that-run-inside-a/m-p/114530#M44861</link>
      <description>&lt;P&gt;This is very interesting (and unexpected) - can you share more about what the job is doing, your configs, and what differences you're seeing in the results?&lt;/P&gt;</description>
      <pubDate>Fri, 04 Apr 2025 14:42:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/differences-between-notebooks-and-notebooks-that-run-inside-a/m-p/114530#M44861</guid>
      <dc:creator>holly</dc:creator>
      <dc:date>2025-04-04T14:42:46Z</dc:date>
    </item>
    <item>
      <title>Re: differences between notebooks and notebooks that run inside a job</title>
      <link>https://community.databricks.com/t5/data-engineering/differences-between-notebooks-and-notebooks-that-run-inside-a/m-p/114540#M44865</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;Hi,&lt;/P&gt;&lt;P class=""&gt;Thanks for your question!&lt;BR /&gt;What I'm doing is essentially loading a table from PostgreSQL using a Spark JDBC connection, and also reading the corresponding table from Databricks. I then perform delete, update, and insert operations by comparing the two datasets.&lt;/P&gt;&lt;P class=""&gt;Using join conditions, I check for differences between the tables. If there are any discrepancies, a query is applied to sync the data in PostgreSQL accordingly—like this:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;    update_cols = info_logic['update_cols'].split(', ')
    change_conditions = [ ( (source_df[col] != postgres_df[col]) | 
        (source_df[col].isNull() &amp;amp; postgres_df[col].isNotNull()) | 
        (source_df[col].isNotNull() &amp;amp; postgres_df[col].isNull()))  for col in update_cols] # do we need to update also if the NULLs are in the right part? Once was inserted the first time in databricks?
    
    final_change_condition = change_conditions[0] # Combine all conditions with "OR" logic to cover any column that changed or is NULL
    for cond in change_conditions[1:]:
        final_change_condition = final_change_condition | cond

    changed_records_df = source_df \
        .join(postgres_df, join_condition, "left_outer") \
        .filter(final_change_condition) \
        .select(source_df["*"]) # since this is my source, my target table in databricks takes these values
    
    num_rows = changed_records_df.count()

    print(f"UPDATE {num_rows} records")
    if num_rows &amp;gt; 0:
        update_cols = [col.strip() for col in info_logic['update_cols'].split(",")]
        primary_keys = [col.strip() for col in info_logic['primary_keys'].split(",")]

        update_data = [tuple(row[col] for col in update_cols + primary_keys) for row in changed_records_df.toLocalIterator()]
        update_query = syncer._generate_update_statement(table_name, info_logic['update_cols'], info_logic['primary_keys'])

        syncer._execute_dml(update_query, update_data, connection_properties, "UPDATE", batch_size=BATCH_SIZE)

------------

    records_to_delete_df = postgres_df.join(
       source_df,
       join_condition,
       "left_anti"
    ).select(*[postgres_df[col.strip()] for col in info_logic['primary_keys'].split(",")])

    num_rows = records_to_delete_df.count()
    print(f"DELETE {num_rows} records")
    if num_rows &amp;gt; 0:
        delete_data = [tuple(row) for row in records_to_delete_df.toLocalIterator()]
        delete_query = syncer._generate_delete_statement(table_name, info_logic['primary_keys'])
        syncer._execute_dml(delete_query, delete_data, connection_properties, "DELETE", batch_size=BATCH_SIZE)

--------

    new_records_df = source_df.join(
        postgres_df,
        join_condition,
        "left_anti"
    ).select(source_df["*"])

    num_rows = new_records_df.count()
    print(f"INSERT {num_rows} records")
    if num_rows &amp;gt; 0:
        all_columns = [col.strip() for col in info_logic['primary_keys'].split(",")]
        if info_logic['update_cols']:
            all_columns.extend([col.strip() for col in info_logic['update_cols'].split(",")])

        insert_data = [tuple(row[col] for col in all_columns) for row in new_records_df.toLocalIterator()]
        insert_query = syncer._generate_insert_statement(table_name, all_columns)

        syncer._execute_dml(insert_query, insert_data, connection_properties, "INSERT", batch_size=BATCH_SIZE)&lt;/LI-CODE&gt;&lt;P&gt;but, this code works properly using notebook interactive but not with job...&lt;/P&gt;</description>
      <pubDate>Fri, 04 Apr 2025 16:27:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/differences-between-notebooks-and-notebooks-that-run-inside-a/m-p/114540#M44865</guid>
      <dc:creator>jeremy98</dc:creator>
      <dc:date>2025-04-04T16:27:48Z</dc:date>
    </item>
    <item>
      <title>Re: differences between notebooks and notebooks that run inside a job</title>
      <link>https://community.databricks.com/t5/data-engineering/differences-between-notebooks-and-notebooks-that-run-inside-a/m-p/114542#M44867</link>
      <description>&lt;P&gt;but sorry, this problem was solved restarting the cluster... that was up starting 4 days ago... what does it mean?&lt;/P&gt;</description>
      <pubDate>Fri, 04 Apr 2025 16:52:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/differences-between-notebooks-and-notebooks-that-run-inside-a/m-p/114542#M44867</guid>
      <dc:creator>jeremy98</dc:creator>
      <dc:date>2025-04-04T16:52:21Z</dc:date>
    </item>
  </channel>
</rss>

