cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
LindaWeimann
Databricks Employee
Databricks Employee

Avnish_Jain_0-1716451066511.png


In highly regulated industries (such as the German Banking sector) where regulations are stringent and legacy systems abound, sharing data products can be a formidable hurdle. However, overcoming these obstacles promises new possibilities for collaboration, innovation, and value creation - ushering in a new era of data-driven services.

In this article, we will dive deeper into how you can:

  1. Ensure compliance with regulatory mandates while enhancing transparency and accountability in data-sharing practices using Delta Sharing.
  2. Scale infrastructure setup for hundreds of customers using Terraform to further enhance the efficiency and reliability of the data distribution process

We will do so by looking into the example of German Banks' closed service provider setup. 

Join us as we navigate the challenges of data distribution in German banks and uncover how Delta Sharing can create a more streamlined and efficient approach to sharing data products. Let's dive in and explore the possibilities.


Quick side note: 
 A 'closed service provider' refers to financial institutions in Germany that offer banking services exclusively to their members or specific groups, often characterized by a focus on providing tailored financial products and services to a select clientele, typically employees of specific companies or members of certain organizations. These closed service providers maintain their customers' data estate and need to find ways to live up to their customers' expectations when it comes to data product distribution.


Requirements

Much like other organizations in the Financial Services Industry (FSI), German banks face challenges and special requirements regarding data product distribution and cloud architecture, primarily due to strict regulations and security concerns. Here are some of the key considerations:

  1. Data Privacy Regulations (e.g., GDPR): German banks must strictly comply with data privacy regulations like GDPR, ensuring encryption and secure access controls for personal data.
  2. Data Residency and Sovereignty: To maintain data sovereignty, German banks must ensure sensitive financial data remains within the country's borders, often necessitating data residency policies when adopting cloud solutions.
  3. Security Standards: Given the sensitivity of financial data, German banks must meet rigorous security standards, including encryption, multi-factor authentication, and continuous cloud infrastructure monitoring.
  4. Audit and Compliance Requirements: German banks are subject to regular audits and compliance assessments by regulatory authorities like BaFin, necessitating detailed records of data processing activities, including those related to cloud services.

To enable German banks' closed service providers to modernize their data product distribution, we must provide a system with:

  • Fine-grained access control ensuring that only authorized users can access specific data, enhancing security and compliance.
  • Access and audit logs offering detailed records of who accessed what data and when, improving transparency and accountability.
  • Detailed data lineage to meet compliance requirements and prove data flows.

Conventionally, without a system capable of distributing datasets, decision-makers would be bound to make their strategic decisions based on duplicated snapshots of data with potentially outdated information. Working on outdated datasets not only compromises decision quality but also significantly hinders the implementation of new use cases (e.g. next best action) because modern, high-value use cases requires fresh and timely data.

However, one advantage of the current set up employed is that one closed service provider can handle the data for multiple banks and the shape of these shared datasets are fundamentally the same across the banks it services. This means you could securely distribute data products by slicing and sharing the right data with the right recipient and offer solutions with greater scalability and efficiency.

 

LindaWeimann_0-1715675314564.png

So, how can we harvest this potential?

Enter from stage left: Delta Sharing

Delta Sharing is a revolutionary protocol designed to facilitate the secure and efficient sharing of large datasets across and between organizations. There are many articles out there (like this one and this one and even this book!) that detail the inner workings and the advantages of Delta Sharing, so I will hold back and discuss only the features relevant to our scenario:

  • Fine-Grained Access Control: Delta Sharing, especially if you don’t share into the open but to a Databricks workspace, enables data owners to define granular access controls through Unity Catalog, specifying who can access their datasets and what actions they can perform. This ensures that sensitive data is only accessible to authorized individuals or entities, fulfilling data security and compliance with regulatory requirements.
  • Access Logs: Delta Sharing includes built-in capabilities for logging access to shared datasets. This feature allows data owners to track which recipient had accessed their data, when they accessed it, and what actions they performed. Access logs are crucial in auditing and compliance efforts, providing transparency and accountability in data-sharing practices.
  • Fine-Grained Lineage: Through Unity Catalog, Delta Sharing provides detailed lineage tracking, allowing data owners to trace the origin and transformations of their datasets. This capability is essential for meeting compliance requirements, as it enables organizations to demonstrate the integrity and provenance of their data. With fine-grained lineage, German banks can respond to regulatory inquiries effectively and provide auditors with clear evidence of data flows and transformations.
  • Encryption: In Delta Sharing, data is end-to-end encrypted leveraging cloud-native encryption technologies using the AES-256-GCM algorithm, a widely used and secure encryption standard. The data is encrypted on the client side before it is shared and remains encrypted while it is being transmitted and stored. The data provider manages the encryption keys and can grant and revoke access to the shared data as needed.

Besides ticking all the regulatory boxes, Delta Sharing, with features like dynamic data updates or even recipient-based partition, allows us to build a simple setup for a complex problem.

Solution Architecture

Now, we have talked a lot, but we haven’t shown what the solution looks like. So, how can a closed service provider serve hundreds of banks with data products? The basis for this is the Lakehouse Architecture. It’s fundamental to the proposed solution, and if you are not familiar with it, I’d recommend reading up on it here.

Once you have your Lakehouse Architecture in place, the last step before sharing the data is to prepare and create the data products. Your data products should come in a way that is convenient for your customers. That means you will want to aggregate your raw data. In addition to highly aggregated data, we recommend bundling lower levels of aggregation and raw data into data products, enabling your customers to dig deeper and understand where different KPIs are coming from. 

And finally comes the sharing step! Once your data is organized and ready to be shared in your catalogs, we can utilize Delta Sharing to make that product available to the appropriate users.

How does it work?

Delta Sharing manages access control by allowing data providers to grant and revoke access to shared data by specifying permissions for each recipient. Through unique URLs with encrypted tokens, the recipient can then read the data from your storage and integrate it into their Unity Catalog using their own compute.

The beauty of this is that since the recipient, i.e., the customer, is using their own compute to analyze the data, you are not being charged for data being read (except for cloud egress costs), which makes billing much more straightforward. So you can price access to your data products upfront. You don’t need to factor in the costs caused by using those data products. 

Delta Sharing is an open format; many clients can read it. If you want to go the extra mile regarding access control and audit logging, you can provide your customers with their own Unity Catalog and Databricks workspace. Delta Sharing is safe even without that: External clients can only access specific files through the Delta Server when granted access. However, Unity Catalog can integrate data shared via Delta Sharing as if it were sitting in the catalog. That means you or your customer can then apply access controls, get audit logging and data lineage, and all the other benefits we discussed for the data product you provide them with.

Note that data lineage is not shared from the provider to the consumer, but combining both sides will provide you with end-to-end lineage.

Delta Sharing grants access to the files in your cloud storage. 

LindaWeimann_1-1715675314624.png

Does that mean you have to partition your data to fit your data products and customers? Not necessarily.

If your customer is using their own machine to request the data, then yes, you cannot allow them to access files that include data not meant to be shared with them. But you wouldn't have to care about the underlying file structure if your customer were to use a machine out of their direct control that would filter the data for you. Now, I said your customer using their own machine was a great benefit in terms of easing the billing process. Wouldn’t a machine out of your customer's direct control undo this benefit? How can we solve this?

We can use a machine associated with your customer but outside their direct control, namely a serverless machine that is still associated with your customers' workspace. 

LindaWeimann_2-1715675314659.png

 

Doing so allows you to define data products that don’t map 1:n to your underlying files and thus grants you great flexibility.

You can even set row and column level permissions based on the attributes of the recipient requesting the data within dynamic views. This feature is in public preview and will greatly enhance the sharing experience.

Note: that this feature requires that your region supports serverless clusters.

Automate with Terraform

At this point, we have the basic setup and can share data products with our customers. But we are talking about more than one customer here. Closed service providers in Germany do this at scale and share data with hundreds of banks. Not only do they provide data products, but they also need to provide the underlying infrastructure. So, in this setup, they need to provision a workspace, handle access management, and provide data at scale and that is not a task to be done manually hundreds of times. This is a task you want to automate: Enter Terraform.

Scaling infrastructure setup for hundreds of customers using Terraform offers numerous benefits for organizations aiming to streamline their data distribution processes. With the Databricks Terraform provider, managing resources within Databricks becomes incredibly efficient. From notebooks to clusters, policies to users/groups, and even the Unity Catalog, Terraform provides a unified platform to oversee and manage every aspect of the data distribution infrastructure.

One of the critical advantages of Terraform lies in its modular approach. Organizations can quickly scale infrastructure setups for multiple customers by utilizing modules and variables. Modules enable the packaging of infrastructure into reusable components, facilitating consistent deployment across various environments or workspaces. Variables, however, allow for the customization of these modules, ensuring that each customer's infrastructure setup meets their unique requirements.

Moreover, automation plays a pivotal role in simplifying the setup process. With the Databricks Terraform provider, organizations can automate the configuration of the Unity Catalog, including setting up a metastore, defining storage for the metastore, configuring external storage, and managing all related access credentials. This level of automation accelerates the deployment process and enhances consistency and reliability across infrastructure setups, ultimately empowering organizations to efficiently manage data distribution for hundreds of customers.

Illimity Bank's use of Terraform led to significant time savings by automating permission-granting processes, reducing system complexity, and improving efficiency. Moreover, they were able to enhance system stability through the reduction of manual errors and the implementation of an automated disaster recovery solution. Why should other regulated companies not benefit from this technology?

Conclusion

In conclusion, transforming data product distribution within German banks using Delta Sharing represents a pivotal shift toward efficiency, compliance, and scalability. By addressing the unique challenges of stringent regulations and security concerns, Delta Sharing offers a comprehensive solution that empowers German Banks’ closed service providers to modernize their data distribution practices.

The future of the banking sector is set to be significantly transformed by technological advancements. Key trends include the rise of Generative AI, which will add complexity but also unlock new possibilities for data analysis and customer service. Open Banking is also spreading globally, enabling a new level of collaboration and data sharing among financial institutions. Furthermore, banks are investing heavily in data modernization, with solutions like Databricks Data Intelligence Platform emerging to provide on-demand insights. By embracing Delta Sharing and leveraging Terraform for infrastructure setup, German banks are well-positioned to unlock new possibilities for collaboration, innovation, and value creation, ushering in a new era of data-driven services in the banking sector.

More resources

To learn more about Delta Sharing, watch how to use Delta Sharing on Databricks or try Delta Sharing yourself

Databricks also provides a set of Terraform templates to deploy a data lakehouse, specifically defined for Financial Services, that incorporate best practices and patterns from over 600 FS customers. They are tailored for key security and compliance policies and provide FS-specific libraries as quickstarts for key use cases, including Regulatory Reporting and Post-Trade Analytics. This allows you to be up and running in a matter of minutes vs. weeks or months.