cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
skhaletski
Databricks Employee
Databricks Employee

The oil and gas industry invests over $600 billion annually in upstream activities, much of it supporting complex, proprietary models and simulation algorithms. Many organizations still depend on legacy C++ and C libraries for critical functions such as reservoir modeling and predictive maintenance, codebases often spanning millions of lines and decades of accumulated intellectual property (IP).

Modern data collaboration demands that these assets be shared across partners, vendors, and joint ventures to accelerate innovation and improve decision-making. Yet, IP protection and regulatory compliance remain major barriers. Companies must balance the need to collaborate securely with the imperative to safeguard proprietary code, data and ML assets.

Databricks Clean Rooms directly address this challenge, enabling organizations to collaborate, analyze, and share insights across boundaries without exposing underlying data or code. This approach bridges the gap between innovation and protection, allowing oil and gas enterprises to unlock the full value of their legacy assets in a governed, modern environment.

This blog post will explore a hypothetical use case, dissecting its complexities and challenges in secure data and IP collaboration, then detail how Databricks Clean Rooms effectively address these issues through their robust architecture and advanced functionalities, ultimately enabling secure, governed, and privacy-preserving collaboration to foster innovation and drive business value.

 

Databricks Clean Rooms at a Glance

Databricks Clean Rooms provide a secure environment for multiple parties to collaborate on data and ML assets, enabling them to perform analysis and build models without revealing their sensitive data or proprietary algorithms to others. In essence, Clean Rooms provide a controlled and secure space where parties can collaborate effectively while maintaining complete control over their most valuable assets.

clean_rooms_secure_collaboration.png

Key characteristics and benefits of a Clean Room include

  • Data Minimization: Only the necessary data for a specific analytical task is brought into the Clean Room, reducing the risk of over-sharing.
  • Cost Offloading: Databricks Clean Rooms use a collaborator-based cost model in which each library consumer, classified as a “collaborator”; a Clean Room creator incurs the primary platform costs; the cost burden is shifted away from the library (asset) owner to the collaborators or customers (they are a Clean Room creators), allowing for clearer pricing models and predictable monetization.
  • Controlled Access and Permissions: Strict access controls and granular permissions are implemented to define who can access what data and what operations they can perform.
  • Auditability: All activities within the Clean Room are logged and auditable, providing transparency and accountability.
  • IP Protection: Collaborators retain full control of their data, ML assets, and Notebooks, sharing them with others only via controlled, privacy-preserving methods.

 

Use Case Overview

Context

A major oil and gas company possesses a highly specialized C++ library, meticulously crafted by their in-house subject matter experts. This proprietary library is designed for advanced data processing, addressing the unique complexities and demands of the O&G industry. Its sophisticated algorithms and tailored functionalities provide a significant competitive advantage in areas such as seismic data interpretation.

Problem Statement

  • The library cannot be released as open source due to stringent intellectual property regulations.
  • The company has made a strategic decision to maintain exclusive control over this valuable asset.
  • The specialized nature of the code, incorporating years of accumulated industry knowledge and proprietary techniques, makes its protection paramount.

Desired Outcomes

  • Commercialization of the existing library: license or otherwise make the invaluable technology accessible to other industry players.
  • Drive innovation and efficiency: leverage the existing robust and industry-specific solution to improve data processing across the energy landscape.


For the purpose of this blog, let’s use the hypothetical C++ library’s structure. We will use the OGSCiSeisLib name for the library. The library’s classes are outlined as follows:

use_case_lib_class_view.png

Code Preparation

Currently, the library exists only as C++ code, but Databricks Clean Rooms do not support direct C++ execution within their collaborative notebooks. To resolve this, Python bindings can be created for the C++ code, allowing it to be packaged as a Python wheel (.whl). Python bindings are wrapper libraries that allow code written in other programming languages, like C++, to be called and used from Python. Therefore this wheel can then be integrated and used in Databricks Clean Rooms, supporting cross-team collaboration while preserving intellectual property rights.

skhaletski_0-1760990002431.png

When working to unlock value from legacy proprietary algorithms written in C or C++, creating Python bindings is a foundational step for integrating these high-performance components into modern data science and analytics workflows. Several well-established tools have become indispensable for this purpose.

Among the most popular Python binding tools, SWIG, pybind11, and Cython, each supports the creation of .whl (wheel) files, which is the standard format for distributing Python packages. This allows developers to easily package, distribute, and install their compiled libraries with pip across different operating systems.

  • SWIG: Enables packaging of C/C++ extensions as Python wheel files, making it possible to distribute compiled bindings through PyPI or internal indexes.
  • pybind11: Integrates seamlessly with Python’s setuptools to build wheel distributions, allowing efficient deployment of C++ modules as binary .whl files.
  • Cython: Designed for compiling Python extensions, Cython supports direct wheel generation for distributing binary Python extension modules.

All three solutions are well-documented for producing wheels (.whl), ensuring library consumers can install native code bindings with a single pip command on any supported platform.

With this approach, users gain access to the robust capabilities of the C++ library from within the secure and collaborative Databricks Clean Rooms environment, sidestepping language compatibility concerns and ensuring sensitive logic remains protected.

 

Uploading and Installation

A common requirement in such environments is the ability to leverage external Python modules, often distributed as .whl (wheel) files, which are a standard format for Python package distribution.

To integrate an external Python package from a .whl file into a Databricks Clean Rooms, follow these steps:

  1. Upload the .whl File: The .whl file containing the desired Python modules must first be uploaded to a secure and accessible location within the Databricks environment. A recommended best practice for this is to utilize Databricks Volumes. Volumes offer a flexible and governed way to manage external data and files, ensuring that the .whl file is stored in a location that can be securely accessed by notebooks within a Databricks Clean Room.
  2. Install from Notebook: Create a Notebook, the uploaded .whl file can then be installed using the pip package manager. The command will typically look like
    !pip install /Volumes/<volume_path>/<library_file>.whl
    where <volume_path> refers to the specific path within the Volume where the .whl file is stored. Executing this command within the notebook will install the package, making its functionalities available for use in the Clean Room.
    Spoiler
    Any notebook references to tables, views, or volumes that were added to the clean room must use the catalog name assigned when the clean room was created (“creator” for data assets added by the clean room creator, and “collaborator” for data assets added by the invited collaborator). For example, a table added by the creator could be named creator.sales.california.

    Likewise, verify that the notebook uses any aliases assigned to data assets that were added to the clean room.

  3. Share assets with Collaborators: Share the Volume and other assets within Clean Room, then configure permissions so each invited collaborator can securely access and run the shared notebook and any linked data assets, all within the governed environment of the Clean Room.

When working to unlock value from legacy proprietary algorithms written in C or C++, creating Python bindings is a foundational step for integrating these high-performance components into modern data science and analytics workflows. Several well-established tools have become indispensable for this purpose.

clean_rooms_legacy_code_flow.png

 

Privacy-Centric Collaboration

It is crucial to understand the inherent security measures within Databricks Clean Rooms, especially when incorporating external code. A critical aspect of maintaining the integrity and security of the Databrics Clean Rooms is the strict review process applied to all Notebooks.

  • Code Review Process: Notebooks in a Databricks Clean Room undergo strict review by collaborators to ensure adherence to security, data privacy, and to prevent access or exposure of sensitive information.
  • Prevention of Unauthorized Code Access: A significant implication of this strict review process is that there is no possibility for unauthorized modification or inspection of the library code once it has been installed. The review process specifically prevents users from altering the installed library's code or attempting to decompile it to view its source. This ensures that the intellectual property embedded within the .whl file remains protected and that the Clean Room environment maintains its integrity.

ooutput_1200x10.gif

By adhering to these steps and understanding the security protocols, organizations can effectively leverage external Python packages within Databricks Clean Rooms while maintaining intellectual property protection.

 

Potential Value and Benefit

Translating legacy proprietary code into Python bindings and making these available in a secure, collaborative clean room generates advantages:

  • Efficient re-use and monetization: Library owners can monetize their proprietary solutions either by licensing access or enabling value-sharing arrangements with collaborators and partners.
  • Expanded usage: Python bindings democratize access, empowering more analysts and data engineers to leverage sophisticated legacy models within interactive workloads.
  • Competitive differentiation: Sharing proprietary algorithms securely can cement a company's role as an industry innovator, amplifying influence while strictly governing how code is consumed, executed, and monetized.
  • Data privacy and security: Databrics Clean Rooms safeguard data and IP, letting code run across boundaries without code or data leakage – critical for the oil and gas industry, which is balancing collaboration and IP protection.

 

Conclusion

For the oil and gas industry, integrating legacy code into modern data and AI platforms like Databricks can unlock decades of institutional knowledge for seismic analysis, reservoir modeling, and production optimization.

Python packaging and Databricks Clean Rooms streamline secure deployment, ensuring collaboration and innovation can happen without compromising proprietary solutions or competitive advantage. This approach provides oil and gas companies with the tools to preserve invaluable IP while driving new efficiencies and insights in a securely governed environment.

By leveraging Clean Rooms, IP owners maximize return and minimize risk: collaborators get straightforward, pay-as-you-go access, while the asset provider gains new revenue streams without absorbing the underlying platform costs. 

It is important to note that while Clean Rooms facilitate sharing of data, code, and ML assets, they do not obviate any legal obligations or compliance requirements pertaining to sharing information.

Contact your Databricks representative for a demo and discussion on transforming energy operations. Explore further industry-specific use cases to harness the power of Databricks.