<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Upsert from Databricks to CosmosDB in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/upsert-from-databricks-to-cosmosdb/m-p/114440#M44826</link>
    <description>&lt;P&gt;Yes, you can update only the modified fields in your Cosmos DB documents from Databricks using the Partial Document Update feature (also known as Patch API). This is particularly useful for large documents where sending the entire document for updates is inefficient.&lt;/P&gt;
&lt;P&gt;Using Partial Document Update from Databricks&lt;/P&gt;
&lt;P&gt;The standard Databricks to Cosmos DB connector with `mode("overwrite")` and `"Upsert": "true"` will replace the entire document, which is causing your issue. Instead, you can:&lt;/P&gt;
&lt;P&gt;1. Use the Cosmos DB SDK directly within Databricks to perform partial updates:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;from azure.cosmos import CosmosClient, PartitionKey&lt;/P&gt;
&lt;P&gt;Connect to your Cosmos DB account&lt;BR /&gt;client = CosmosClient(endpoint="&lt;A href="https://your-account.documents.azure.com:443/" target="_blank"&gt;https://your-account.documents.azure.com:443/&lt;/A&gt;", credential="your-key")&lt;BR /&gt;database = client.get_database_client("mydatabase")&lt;BR /&gt;container = database.get_container_client("mycontainer")&lt;/P&gt;
&lt;P&gt;Define patch operations&lt;BR /&gt;operations = [&lt;BR /&gt;{"op": "replace", "path": "/field1", "value": "new_value"},&lt;BR /&gt;{"op": "add", "path": "/newField", "value": 123}&lt;BR /&gt;]&lt;/P&gt;
&lt;P&gt;Apply patch operations&lt;BR /&gt;container.patch_item(item="document_id", partition_key="partition_key_value", patch_operations=operations)&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;2. For batch processing, you can use Databricks to identify changed fields and then construct patch operations:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;# Process in batches&lt;BR /&gt;def patch_document(row):&lt;BR /&gt;doc_id = row["id"]&lt;BR /&gt;partition_key = row["partitionKey"]&lt;BR /&gt;&lt;BR /&gt;# Determine which fields changed&lt;BR /&gt;operations = []&lt;BR /&gt;for field in changed_fields:&lt;BR /&gt;operations.append({"op": "replace", "path": f"/{field}", "value": row[field]})&lt;BR /&gt;&lt;BR /&gt;# Apply patch&lt;BR /&gt;container.patch_item(item=doc_id, partition_key=partition_key, patch_operations=operations)&lt;/P&gt;
&lt;P&gt;# Apply to your dataframe&lt;BR /&gt;df.foreach(patch_document)&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;## Benefits of Partial Updates&lt;/P&gt;
&lt;P&gt;1. Reduced network traffic - only sending changed fields&lt;BR /&gt;2. Lower RU consumption - partial updates typically cost less than full replacements&lt;BR /&gt;3. Avoids concurrency issues - no need to read-modify-write the entire document&lt;BR /&gt;4. Preserves existing fields - other properties remain untouched&lt;/P&gt;
&lt;P&gt;This approach is much more efficient than the current connector-based approach where the entire document is replaced, causing fields to be deleted when they're not included in the new document.&lt;/P&gt;
&lt;P&gt;For very large datasets, you might need to implement batching and error handling to manage the process efficiently.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 03 Apr 2025 17:21:05 GMT</pubDate>
    <dc:creator>Louis_Frolio</dc:creator>
    <dc:date>2025-04-03T17:21:05Z</dc:date>
    <item>
      <title>Upsert from Databricks to CosmosDB</title>
      <link>https://community.databricks.com/t5/data-engineering/upsert-from-databricks-to-cosmosdb/m-p/114295#M44779</link>
      <description>&lt;P&gt;Hi guys,&lt;/P&gt;&lt;P&gt;I'm adjusting a data upsert process from Databricks to CosmosDB using the .jar connector. As the load is very large, do you know if it's possible to change only the fields that have been modified?&lt;/P&gt;&lt;P&gt;best regards&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2025 13:43:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/upsert-from-databricks-to-cosmosdb/m-p/114295#M44779</guid>
      <dc:creator>William_Scardua</dc:creator>
      <dc:date>2025-04-02T13:43:49Z</dc:date>
    </item>
    <item>
      <title>Re: Upsert from Databricks to CosmosDB</title>
      <link>https://community.databricks.com/t5/data-engineering/upsert-from-databricks-to-cosmosdb/m-p/114440#M44826</link>
      <description>&lt;P&gt;Yes, you can update only the modified fields in your Cosmos DB documents from Databricks using the Partial Document Update feature (also known as Patch API). This is particularly useful for large documents where sending the entire document for updates is inefficient.&lt;/P&gt;
&lt;P&gt;Using Partial Document Update from Databricks&lt;/P&gt;
&lt;P&gt;The standard Databricks to Cosmos DB connector with `mode("overwrite")` and `"Upsert": "true"` will replace the entire document, which is causing your issue. Instead, you can:&lt;/P&gt;
&lt;P&gt;1. Use the Cosmos DB SDK directly within Databricks to perform partial updates:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;from azure.cosmos import CosmosClient, PartitionKey&lt;/P&gt;
&lt;P&gt;Connect to your Cosmos DB account&lt;BR /&gt;client = CosmosClient(endpoint="&lt;A href="https://your-account.documents.azure.com:443/" target="_blank"&gt;https://your-account.documents.azure.com:443/&lt;/A&gt;", credential="your-key")&lt;BR /&gt;database = client.get_database_client("mydatabase")&lt;BR /&gt;container = database.get_container_client("mycontainer")&lt;/P&gt;
&lt;P&gt;Define patch operations&lt;BR /&gt;operations = [&lt;BR /&gt;{"op": "replace", "path": "/field1", "value": "new_value"},&lt;BR /&gt;{"op": "add", "path": "/newField", "value": 123}&lt;BR /&gt;]&lt;/P&gt;
&lt;P&gt;Apply patch operations&lt;BR /&gt;container.patch_item(item="document_id", partition_key="partition_key_value", patch_operations=operations)&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;2. For batch processing, you can use Databricks to identify changed fields and then construct patch operations:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;# Process in batches&lt;BR /&gt;def patch_document(row):&lt;BR /&gt;doc_id = row["id"]&lt;BR /&gt;partition_key = row["partitionKey"]&lt;BR /&gt;&lt;BR /&gt;# Determine which fields changed&lt;BR /&gt;operations = []&lt;BR /&gt;for field in changed_fields:&lt;BR /&gt;operations.append({"op": "replace", "path": f"/{field}", "value": row[field]})&lt;BR /&gt;&lt;BR /&gt;# Apply patch&lt;BR /&gt;container.patch_item(item=doc_id, partition_key=partition_key, patch_operations=operations)&lt;/P&gt;
&lt;P&gt;# Apply to your dataframe&lt;BR /&gt;df.foreach(patch_document)&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;## Benefits of Partial Updates&lt;/P&gt;
&lt;P&gt;1. Reduced network traffic - only sending changed fields&lt;BR /&gt;2. Lower RU consumption - partial updates typically cost less than full replacements&lt;BR /&gt;3. Avoids concurrency issues - no need to read-modify-write the entire document&lt;BR /&gt;4. Preserves existing fields - other properties remain untouched&lt;/P&gt;
&lt;P&gt;This approach is much more efficient than the current connector-based approach where the entire document is replaced, causing fields to be deleted when they're not included in the new document.&lt;/P&gt;
&lt;P&gt;For very large datasets, you might need to implement batching and error handling to manage the process efficiently.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Apr 2025 17:21:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/upsert-from-databricks-to-cosmosdb/m-p/114440#M44826</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-04-03T17:21:05Z</dc:date>
    </item>
  </channel>
</rss>

