Databricks Community

User16826994223 · ‎06-18-2021

How does Delta Sharing work?

Delta Sharing is a simple REST protocol that securely shares access to part of a cloud dataset. It leverages modern cloud storage systems, such as S3, ADLS or GCS, to reliably transfer large datasets. There are two parties involved: Data Providers and Recipients.

As the Data Provider, Delta Sharing lets you share existing tables or parts thereof (e.g., specific table versions of partitions) stored on your cloud data lake in Delta Lake format. A Delta Lake table is essentially a collection of Parquet files, and it’s easy to wrap existing Parquet tables into Delta Lake if needed. The data provider decides what data they want to share and runs a sharing server in front of it that implements the Delta Sharing protocol and manages access for recipients. We’ve open sourced a reference sharing server; and we provide a hosted one on Databricks, as we imagine other vendors will.

As a Data Recipient, all you need is one of the many Delta Sharing clients that supports the protocol. We’ve released open source connectors for pandas, Apache Spark, Rust and Python, and we’re working with partners on many more.The actual exchange is carefully designed to be efficient by leveraging the functionality of cloud storage systems and Delta Lake. The protocol works as follows:

The recipient’s client authenticates to the sharing server (via a bearer token or other method) and asks to query a specific table. The client can also provide filters on the data (e.g. “country=US”) as a hint to read just a subset of the data.
The server verifies whether the client is allowed to access the data, logs the request, and then determines which data to send back. This will be a subset of the data objects in S3 or other cloud storage systems that actually make up the table.
To transfer the data, the server generates short-lived pre-signed URLs that allow the client to read these Parquet files directly from the cloud provider, so that the transfer can happen in parallel at massive bandwidth, without streaming through the sharing server. This powerful feature available in all the major clouds makes it fast, cheap and reliable to share very large datasets.