cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Sharing Approach for Secure Data Access in Development Environment

Phani1
Valued Contributor II

Hi Team,

We have a scenario

Problem Statement:  The customer currently has data in both production and stage environments, with the stage environment being used primarily for development and bug fixing activities. They now want to separate these environments based on workload type - creating a dedicated development environment for all development activities, while reserving the stage environment for testing and production bug fixes. However, the development environment currently lacks data, and the customer is concerned about copying data from stage/production to development due to the presence of sensitive information such as PII data.

Solution Approach :

We are considering an approach that involves creating Delta Sharing (sharing only the required tables) from stage/production environments to lower environments, providing access only to authorized personnel. On top of these Delta Shares, we will create views with the same names as their source tables, and these views will invoke UDF functions containing logic to protect sensitive data through multiple layers of encryption. If required, we will also create delta tables based on these views. This way we can share sensitive data more securely.
 
Kindly let me know if this approach is suitable or if there are any alternative approaches we should consider, please suggest.
 
Regards,
Phani
1 REPLY 1

loui_wentzel
Contributor

Hey Phani!

Cool setup you have there - some comments and ideas:

  • Generally it sounds like you have a good apporch - Setting up a dedicated dev environement apart for staging and prod is the way. However, restricting access to tables in dev is generally not a viable options, as theis is the environement where everyone should have the most freedome. I'd strongly suggest focussing on making the data sharable rather than locking it down.
  • Are the different worspaces within the same region? This makes it a lot easier, as you actually won't need Delta sharing and can share by simply giving access in Unity Catalog, see this image and it's documentation.
  • Regarding PII sensitivity: You should only need a subset of data (both in amount of tables and size of tables) thus know which tables you want to push to dev. Besides UDF, there are some cool tools to help you mask or filter out PII data including column filters based on RBAC roles, which should carry over if you do databricks to databricks delta sharing, as well as upcomming policy tags shown at DIAS here, although this is still in beta.
  • Generally advising against setting up automatic pipelines and runs that push data to dev, as it increases chance of pushing un-wanted data to dev in an ammount that is not needed.
  • An exctra check when you invoke your pipelines, could be to check for mail signatures, PII values (some simple checks for full names or values that resemble sensitive ID's e.g. social security or similar) - these should be soft checks, but something that the person pushing needs to verify before finalizing push to dev.

I hope these thoughts help 🙂

 

Best regerads,

Loui