cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
SashankKotta
Databricks Employee
Databricks Employee

SashankKotta_0-1748066437294.png

 

Introduction

In today's data-driven economy, organisations are increasingly recognising the value of their data assets. However, traditional methods of sharing data often involve costly replication, complex infrastructure setups, and security challenges. Databricks Delta Sharing offers a revolutionary approach to securely and cost-effectively monetise data assets, enabling seamless collaboration across platforms, clouds, and organisations.

The Power of Delta Sharing

Delta Sharing is an Open-source protocol for secure data sharing, designed to eliminate inefficiencies in traditional methods. It allows organisations to share live data directly from their cloud object stores without replication or the need for specialised computing environments. This approach significantly reduces costs while ensuring robust security and governance.

SashankKotta_1-1745389900945.png

Key Features of Delta Sharing

  1. No Data Replication: Share data directly from existing cloud object stores, reducing storage costs.
  2. Cross-Platform Compatibility: Share data across clouds (Azure, AWS, GCP) and on-premises systems using open-source connectors like Pandas, Apache Spark™, Tableau, and Power BI.
  3. Centralised Governance: Manage permissions, track usage, and audit shared datasets through Unity Catalog.
  4. Real-Time Data Access: Enable live data sharing without delays or inconsistencies.
  5. Incremental Data Sharing with CDF: Delta Sharing supports Change Data Feed (CDF)-enabled objects, allowing incremental changes to be seamlessly delivered to downstream systems.

Cost Optimisation Through Delta Sharing

Delta Sharing provides significant cost savings compared to traditional data-sharing methods:

  1. Reduced Storage Costs:
    1. With Delta Sharing, the recipient has the flexibility to access live data directly(without data replication) or replicate data as needed for their use case. Accessing data in live mode allows you to minimize storage costs, as there is no requirement to maintain duplicate copies of the data.
  2. Eliminated Egress Fees:
    1. By choosing Cloudflare R2 object storage, you have the option to avoid egress fees. Replicating or migrating data that you share to R2 allows you to use Delta Sharing without incurring egress fees. Please note, however, that this benefit does not apply to view sharing, which may still result in egress costs.
  3. Lower Infrastructure Costs:
    1. With Delta Sharing, data providers are only responsible for storage costs, as recipients access and process the shared data using their own compute resources, eliminating any additional compute charges for the provider when consumers query the data. In contrast, when sharing data via views, compute costs are incurred on the provider’s side each time a consumer runs a query.
  4. Streamlined Operations:
    1. Centralised governance reduces administrative overheads and ensures compliance with regulatory requirements.

Security and Governance

Delta Sharing ensures the secure sharing of data assets through robust security features:

  1. End-to-end TLS encryption from client to server to storage account.
  2. Access Control: Role-based access control (RBAC) ensures that only authorised users can access shared datasets.
  3. Auditing & Monitoring: Comprehensive logging tracks all access and usage activities for compliance purposes.
  4. For detailed information, refer to the links below: 

Delta Sharing Use Cases

Below, we are going to cover some use cases for Delta Sharing, covering different scenarios

Use case 1

An analytics company, X, has two teams: The Data Engineering team and the Reporting team. Due to regional and cloud constraints, these teams operate in separate Unity Catalog metastores across different clouds, The DE team works on a Metastore built on Azure while some business users work on a metastore built on GCP for specific use cases. This setup has led to significant data duplication, as data needs to be copied between metastores frequently. The organisation is now seeking an alternative to eliminate this redundancy. The Reporting team primarily works on dashboards created from streaming workloads that are ingested into the GCP metastore.

Solution:

Databricks-to-Databricks (D2D) Delta Sharing is an ideal solution for this scenario since both teams use Databricks. It enables secure and efficient sharing of data across Unity Catalog metastores across clouds without requiring data replication.

Step-by-step guide for setting up D2D Delta sharing for streaming workload:

Below are the steps for setting up the same:

  1. For this use case, we will use a sample streaming table created as below.
    def start_stream_restart():
     while True:
       try:
         q = (spark.readStream
                     .format("cloudFiles")
                     .option("cloudFiles.format", "json")
                     .option("cloudFiles.schemaLocation", f"{volume_folder}/inferred_schema")
                     .option("cloudFiles.inferColumnTypes", "true")
                     .load(volume_folder+"/user_json")
                   .writeStream
                     .format("delta")
                     .option("checkpointLocation", volume_folder+"/checkpoint")
                     .option("mergeSchema", "true")
                     .table("autoloader_demo_output"))
         q.awaitTermination()
         return q
       except BaseException as e:
         if not ('UnknownFieldException' in str(e)):
           raise e
    
    start_stream_restart()
  2. Below, Metastore-level permissions are required for sharing an object using Delta Sharing. SashankKotta_2-1745390874489.png
  3. Follow the steps below to create a Share and a Recipient and establish a connection between them.
    1. Create a Share: Use the query below to create a share. 
      CREATE SHARE IF NOT EXISTS sash_demo_share_to_ws
      COMMENT "Demo share"; 
      Once the share is successfully created, it will appear in the Delta sharing section, as illustrated below.SashankKotta_0-1745391379232.png

       

    2. Add the Streaming Table to the Share: Add the streaming table created in the previous step to the share using the provided query. In this query we used History to share the table history to allow recipients to perform time travel queries or read the table with Spark Structured Streaming. For Databricks-to-Databricks shares, the table’s Delta log is also shared to improve performance. History sharing requires Databricks Runtime 12.2 LTS or above.
      ALTER SHARE sash_demo_share_to_ws
      ADD TABLE main.dbdemos_autoloader.autoloader_demo_output
      COMMENT "Add streaming table to share"
      WITH HISTORY;
       You can verify its addition in the share, as shown below.SashankKotta_1-1745391434176.png
    3. Create a Recipient: To create a recipient, obtain the recipient sharing identifier from the recipient workspace by executing the specified query. This identifier will follow the format <cloud>:<region>:<uuid>.
      SELECT current_metastore()​
    4. Set Up the Recipient: Use the given query to create the recipient.
      CREATE RECIPIENT IF NOT EXISTS sash_demo_recipient_ne_ws
      USING ID '<cloud>:<region>:<uuid>'
      COMMENT "recipient: workspace";​
      Confirm that the recipient has been successfully created by reviewing it, as shown below.SashankKotta_2-1745391510901.png
    5. Grant Recipient Access to the Share: Grant access to the recipient using the provided query.
      GRANT SELECT ON SHARE sash_demo_share_to_ws TO RECIPIENT sash_demo_recipient_ws;​
      Verify that the recipient now has access to the share, as demonstrated below.SashankKotta_3-1745391565631.png

       

    6. Create a Delta Share Catalog on Recipient Workspace: On the recipient workspace, set up a Delta share catalog using the Provider’s sharing identifier.
      CREATE CATALOG IF NOT EXISTS sash_demo_delta_share_catalog
      USING SHARE `<cloud>:<region>:<uuid>`.sash_demo_share_to_azure;​
    7. Access Streaming Table:The streaming table is now available in the recipient workspace as a Delta share object and can be accessed effortlessly.SashankKotta_0-1745404631492.png
  4. Below is a dashboard built on the Delta sharing streaming table in the Recipient metastore on GCP. Note: If you encounter performance issues with your dashboards, you might consider incrementally replicating only the specific tables that are causing the problems and building the dashboard on top of those replicated tables.SashankKotta_1-1745404667168.png

Use Case 2

The Same Analytics company X also has a dataset in Metastore 1 built on Azure cloud, They want to incrementally load this dataset as a feed in Metastore 2 which is built on GCP cloud.

Solution:

Instead of daily sending a feed file with incremental load, Databricks to Databricks (D2D) delta sharing with CDF enabled is a perfect solution for this use case.

Step-by-step guide for setting up D2D Delta sharing for incremental load:

Below are the steps for setting up the same:

  1. We shall use the below sample table with CDF enabled from a workspace linked to Azure metastore.SashankKotta_2-1745404783910.png
  2. Repeat Steps 2 and 3 from Use Case 1 for setting up the Share(Azure metastore) and Recipient(GCP Metastore) for this table.
  3. Access the table from the Delta sharing catalog as a recipient from GCP metastore and incrementally load the downstream. The table shown below is accessed for incremental updates/inserts from a workspace in the GCP metastore.SashankKotta_3-1745404809366.png

Manage Your Delta Shares Using System Tables

The following system tables offer key insights into Delta shares, recipients, tables/schemas included in the shares, and the mapping between shares and recipients:

  1. SYSTEM.INFORMATION_SCHEMA.SHARES: Provides essential details about the Delta shares that have been created.
  2. SYSTEM.INFORMATION_SCHEMA.TABLE_SHARE_USAGE: Contains comprehensive information about the tables used within the Delta shares.
  3. SYSTEM.INFORMATION_SCHEMA.SHARE_RECIPIENT_PRIVILEGES: Displays mapping details between Delta shares and their respective recipients.
  4. SYSTEM.INFORMATION_SCHEMA.SCHEMA_SHARE_USAGE: Offers information about the schemas included in the Delta shares.
  5. SYSTEM.INFORMATION_SCHEMA.RECIPIENTS: Contains all relevant details about recipients.
  6. SYSTEM.INFORMATION_SCHEMA.RECIPIENT_ALLOWED_IP_RANGES: Primarily used to check the allowed IP ranges configured for a recipient.
  7. SYSTEM.INFORMATION_SCHEMA.RECIPIENT_TOKENS: Stores data related to configuration tokens generated for authentication purposes, useful especially for D2O sharing.
  8. SYSTEM.ACCESS.AUDIT: Audit logs enable organisations to track and monitor all Delta Sharing activities, providing detailed records of who accessed, shared, or managed data assets for compliance, security, and usage analysis.

Best practices

  1. Secure Delta Sharing: Restrict Access with Fine-Grained Permissions: Grant access only to specific tables, views, or columns based on the recipient's needs. Avoid sharing unnecessary data. 
  2. Audit Data Access: Regularly monitor and audit who accessed the shared data using Unity Catalog logs.
  3. Optimizing data for Delta sharing: 
    1. Use Managed Delta Tables: Delta Sharing supports managed Delta tables, external Delta tables, and views. Use managed tables for better performance and governance.
    2. Leverage Liquid Clustering for optimized performance.
  4. Manage Recipients:
    1. Revoke Access When Needed: Regularly review and revoke access for recipients who no longer need it.
  5. Automate Processes
    1. Automate Share Creation and Updates: Use APIs or scripts to automate the creation of shares and recipients, especially in large-scale environments.
    2. Integrate with CI/CD Pipelines: Automate the management of Delta Sharing as part of your CI/CD workflows for better consistency.
  6. Use Incremental Updates
    1. Leverage Change Data Feed (CDF): Use the Delta Lake Change Data Feed(CDF) feature to share only incremental updates instead of re-sharing entire datasets.
  7. Ensure Compliance:
    1. Verify Regulatory Compliance: Ensure that shared datasets comply with relevant regulations (e.g., GDPR, HIPAA) before sharing externally.
    2. Use Privacy-Safe Data Clean Rooms: For sensitive collaborations, consider using privacy-safe environments where raw data isn't exposed.
  8. Maintain Clean Shares:
    1. Review Shares Periodically: Regularly review active shares and recipients to ensure they are still necessary and relevant.
    2. Remove Expired Shares: Revoke access for recipients who no longer need it or when a share is no longer valid.

Common mistakes

  1. Incorrect Permissions Setup: Granting excessive or insufficient permissions at the metastore, catalog, schema, or table level. Recipients may access unauthorized data or fail to access the intended data.
  2. Sharing Unsupported objects: Attempting to share objects that are not supported by Delta Sharing, such as non-Delta tables or Delta-shared views.
  3. Misconfiguration of Recipients: Using incorrect recipient sharing identifiers or failing to verify recipient configurations.
  4. Schema Changes Without Notification: Making schema changes (e.g., adding/removing columns) without notifying recipients breaks recipient workflows and dashboards relying on the shared schema.
  5. Not Using Incremental Updates: Re-sharing entire datasets instead of using incremental updates with Change Data Feed (CDF). This results in  Increased costs and inefficiencies due to redundant data sharing.
  6. Overlooking Cost Implications: Ignoring egress costs when sharing data across regions or clouds, Unexpectedly high costs for sharing large datasets over long distances.
  7. Mismanagement of Expired Shares: Forgetting to revoke access for recipients who no longer need it or when a share is no longer valid, This leads to unauthorized access to outdated or sensitive data.

Conclusion

Delta Sharing provides a revolutionary approach to secure, efficient, and cost-effective data sharing across platforms, clouds, and organizations. By eliminating the need for data replication and leveraging features like Change Data Feed (CDF) for incremental updates, it addresses common challenges such as data duplication, high costs, and complex infrastructure setups. With robust governance through Unity Catalog, fine-grained access control, and compatibility with popular tools, Delta Sharing empowers organizations to streamline operations, enhance collaboration, and monetize their data assets effectively. Following best practices ensures optimal performance, security, and compliance while avoiding common pitfalls in data-sharing workflows.

Appendix:

Reference links:

https://docs.databricks.com/aws/en/delta-sharing/

https://docs.databricks.com/aws/en/delta-sharing/share-data-databricks

https://docs.databricks.com/aws/en/delta-sharing/manage-egress#use-cloudflare-r2-replicas-or-migrate... 

https://www.databricks.com/blog/announcing-public-preview-streaming-table-and-materialized-view-shar...

 

 

1 Comment