cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Version 1.1: Data Isolation and Governance within PySpark DataFrame's

joseph_in_sf
New Contributor III

See the comments below for a runnable notebook

Throughout my career I have worked at several companies that handle sensitive data; including PII, PHI, EMR, HIPPA, Class I/II/III FOMC - Internal (FR). One entity I worked at even required a Department Of Justice background check. Engineers handling sensitive data must consider how it may be exposed and develop safeguards to minimize risk.  Protecting PII (personally identifiable information) is very important as the number of data breaches and records with sensitive information exposed every day is trending upwards; do you really want your private personal medical details exposed to the internet, or even be partially liable for it due to bad coding and practices?!  Per HIPAA, ā€œat restā€ and ā€œin motionā€ must be encrypted; these requirements are the bare minimum and often insufficient, according to the author's opinion. From the perspective of typical Databricks pipeline often that means the cloud storage bucket has default encryption turned on and that’s about it. 

It is the authors opinion that to create truly ethically compliant pipelines there is another requirement: data should be encrypted when ā€œat restā€, ā€œin motionā€, and ā€œnot actively being transformedā€. In other words, if a Databricks Notebook (hereinafter ā€˜pipeline’) loads patients’ data, does its necessary transformations but none of those transformations are on a patient's first or last name, then at no time during that pipelines lifetime should either of those fields be decrypted.  

 

What is Data Governance

Data governance is a framework of policies, processes, roles, and standards that ensures an organization's data is secure, accurate, available, and usable throughout its entire lifecycle, guiding how data is collected, stored, accessed, and disposed of to meet business goals and comply with regulations. It acts as a central system, defining who can do what with data, ensuring data quality, preventing inconsistencies, and building trust in data assets for better decision-making

 

What is Schema Governance

Ditto, but with Spark schemas. Scala’s Spark implementation has a class called Dataset[T], it implements dataset schema governance through a combination of robust type-safety mechanisms like case classes and Scala’s strong typing language.  Using Dataset[T] in Scala, it is super easy to enforce strict or non-strict compliance as interfaces on methods.

Example Spark StructType defined in JSON

{
 "fields": [
  {
   "name": "ID",
   "nullable": false,
   "type": "string",
   "metadata": {"PK":true, "comment": "PK", "validation":{"regex_match":"/d{10}"}}
  },
  {
   "name": "FirstName",
   "nullable": false,
   "type": "string",
   "metadata": {ā€œPIIā€:true}
  },
  {
   "name": "SSN",
   "nullable": false,
   "type": "string",
   "metadata": {ā€œPIIā€:true}
  }],
 "type": "struct"
}

 

Example of Dataset[T] via CaseClass generated via StructType

def limitDatasetToPersonName(peopleDS: Dataset[Person]): Dataset[PersonPHI] = {
šŸ’ŖšŸ’ŖšŸ’Ŗ
}

def transformData(sourceDS: Dataset[SourceData]): Dataset[TargetData] = {
šŸ’ŖšŸ’ŖšŸ’Ŗ
}

Unfortunately, neither Python nor PySpark have these superman abilities; implementing compliance into PySpark will be necessary later. 

 

What is Data Isolation

Data isolation is the practice of separating data sets to prevent unauthorized access, limit the impact of breaches, and maintain system stability, ensuring one user's or application's data remains invisible and inaccessible to others, even on shared infrastructure like the cloud. It's achieved through logical controls (access rules, authentication) or physical separation (dedicated servers) and is crucial for security, performance, and compliance.

 

The Data Isolation Spectrum in Databricks

Data Isolation in Databricks is a multi-layered strategy used to separate data and AI assets to ensure security, compliance, and transaction integrity.  Think of isolation as layers, not a single switch. The higher you isolate, the smaller the blast radius.  HIPPAs legal compliance can be achieved at any of these levels.

 

Level

Strength

Use Case

Account

⭐⭐⭐⭐⭐

Strongest

Workspace

⭐⭐⭐⭐

Recommended

Network

⭐⭐⭐⭐

Defense-in-depth

Storage

⭐⭐⭐⭐

Required

Compute

⭐⭐⭐

Sensitive jobs

UC Catalog

⭐⭐

Governance

UC Schema

⭐

Organization

Table

⭐

Least privilege

Account and Workspace would appear to be the no-brainer choice (these options only became available in DataBricks late 2020).  At time of writing according to Databricks resource limit document. ā€œa user can't belong to more than 50 Databricks accountsā€ (fixed), and each account can have 50 workspaces (not fixed). The number of workspaces is tied more to the underlying cloud provider's subscription limits than a hard limit imposed by Databricks.  These limits are on resources like vCPUs, network infrastructure, or the number of virtual machines that can be launched.  After increasing the limits on the cloud provider, Databricks can then increase those limits on their end.  A quick calculation, each customer will most likely need a QA, DEV, and PROD environment. Hence; for 100 customers, we would need 300 workspaces.  An astute engineer would quickly realize that they would want an infrastructure-as-code tool like Terraform first and ensure they can provision 300 workflows first.  Here is a Databricks blog regarding best practices in platform architecture that is worth reading.

āœ… Safest Interpretation
One Databricks account, one workspace per customer
Dedicated cloud resources (VNet/VPC)
Unity Catalog scoped to that workspace only
No shared metastores or storage locations
This is the cleanest HIPAA story for auditors
Following this alone does not necessarily equate to HIPPA Safe Harbor.

While multiple workspaces are a recommended pattern for strong isolation (especially dev/staging/prod or different business units), they introduce real operational trade-offs compared to a single (or very few) workspace approach — especially when using Unity Catalog for centralized governance.

#

Disadvantage

Explanation / Impact

Severity (typical)

1

Increased administrative & operational overhead

Every workspace needs separate setup, maintenance, user/group management, cluster policies, IP ACLs, PrivateLink config (if used), etc. Manual management becomes painful quickly.

High

2

Harder to automate & higher IaC complexity

You must use Infrastructure-as-Code (Terraform / Databricks Asset Bundles / APIs) to keep environments consistent — otherwise chaos ensues.

High

3

Worse collaboration & knowledge sharing

Notebooks, dashboards, experiments, and jobs cannot be easily shared/collaborated on across workspaces. Cross-workspace lineage visibility and discovery are more limited.

Medium–High

4

Higher cost (direct + indirect)

More workspaces → more small/medium clusters → potentially higher DBU consumption due to less pooling. Also indirect costs from more admin time and setup.

Medium

5

Workspace resource limits become relevant

Each workspace has its own per-workspace limits (concurrent clusters, jobs, users, API rate limits, etc.). Heavy usage in one workspace won't affect others, but you hit limits faster per workspace.

Medium

6

Account-level workspace limit

Default hard/soft limit ~50 workspaces per account (enterprise tier). Large orgs with dozens of teams → may need multiple Databricks accounts → even more complexity.

Medium–High (large orgs)

7

Switching friction for users

Users must constantly switch workspaces in the UI → annoying for people who need to work across environments/teams.

Medium

8

More complex monitoring & observability

Harder to get a unified view of jobs/clusters/costs/performance across many workspaces (requires account-level tools or extra setup).

Medium

9

Slower onboarding & higher learning curve for new users

New team members need to understand which workspace does what, how permissions differ, etc.

Low–Medium

10

Some features are harder/more limited

Certain things (e.g. some legacy feature store patterns, cross-workspace notebook collaboration, shared SQL warehouses) become more complicated or impossible.

Low–Medium

11

Unity Catalog not workspaces

Unity Catalog is the real protector of the data; not workspaces. Data can gain visibility where it should not through multiple holes.

Medium

Bottom line as of 2025 is that the sweet spot for most organizations is 3–12 workspaces (typically dev / staging / prod + maybe 1–2 sandboxes or department-specific ones).

Going beyond ~20–30 usually signals the necessity to heavily invest in automation and a good full time DevOps engineer, or two, to own it; question whether every new request really needs its own workspace — many times catalog-level isolation + cluster policies + Unity Catalog bindings is enough.

If you're hitting scaling pain with multiple workspaces today, consider consolidating where possible and leaning harder on Unity Catalog + workspace-catalog bindings + Schema Governance + Encryption UDFs for isolation instead of proliferating more workspaces.

 

Data Isolation by BYOK Encryption UDFs

For all practical purposes, data isolation is achieved when it cannot be read by an unauthorized entity.

Read more here about why HIPPA Safe Harbor is the gold standard; if you strictly follow the 18-identifier removal list, HIPAA guarantees that the data is legally de-identified, so the entity cannot be penalized for HIPAA violations for using or sharing that data.  HIPAA's encryption is a rule under the HITECH Act that exempts organizations from mandatory breach notification if stolen or lost electronic Protected Health Information (ePHI) is rendered unusable, unreadable, or indecipherable through strong encryption, often aligning with NIST standards (like AES), meaning no breach is deemed to have occurred, provided encryption keys aren't compromised. To qualify, all ePHI, both at rest (stored) and in transit (moving), must be encrypted using robust methods, often requiring strong passcodes and key management, to ensure data remains secure even if devices are lost or accessed without authorization. FYI, there are other details too that are important not mentioned here.

First, we need the ability to have a mechanism to encrypt and decrypt data in a DataFrame, here is a UDF that gets us there. Next, every customer has a service principal identity with one and only one key registered within the secret store. Under this situation, we have encryption level data isolation, even if one customer got another customers data on accident, they still cannot read it since they lack the correct key to decrypt it.

from cryptography.fernet import Fernet
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Load once - do this in your notebook or job initialization
FERNET_KEY = dbutils.secrets.get(scope="crypto", key=f"{current_user}-fernet-main-key-2025")
fernet_singleton = Fernet(FERNET_KEY.encode())

def encrypt_fernet(plain_text):
    if plain_text is None:
        return None
    return cipher_suite.encrypt(plain_text.encode()).decode()

def decrypt_fernet(encrypted_text):
    if encrypted_text is None:
        return None
    return cipher_suite.decrypt(encrypted_text.encode()).decode()

def decrypt_columns(df: DataFrame, columns: list[str]) -> DataFrame:
    return df.transform(
        lambda d: d.select(
            *[
                decrypt_fernet(col(c)).alias(c) if c in columns else col(c)
                for c in d.columns
            ]
        )
    )

def encrypt_columns(df: DataFrame, columns: list[str]) -> DataFrame:
    return df.transform(
        lambda d: d.select(
            *[
                encrypt_fernet(col(c)).alias(c) if c in columns else col(c)
                for c in d.columns
            ]
        )
    )

 

Schema Governance in PySpark

As mentioned above, PySpark does not have Scala’s Dataset class. Here is a minimal implementation of it in PySpark. Scala being a strong static type language will throw an exception at compile time, best PySpark can do is at runtime, unfortunately.

def Dataset(df, schema:StructType, strict=true) -> Dataframe:
    # Emulates Scala’s Dataset
    if strict and len(schema.fieldNames()) != len(df.columns):
        raise Exception("not equal number of columns between the two")

    for field in schema.fields:
        col_name = field.name
        col_type = field.dataType
        col_null = field.nullable

        if col_name in df.columns:

           if not strict:
                 continue

            # check for existence
            if not (col_name in schema.fieldNames()):
                raise Exception(f"column missing: {col_name}")
            elif col_type != df.schema[col_name].dataType:
                raise Exception(
                    f"column type is incorrect for '{col_name}', expected '{col_type}', got '{df.schema[col_name].dataType}'. "
                )

            elif not col_null:
                null_count = df.filter(df[col_name].isNull()).count()
                if null_count > 0:
                    raise Exception("Null Constraint Violation")
        else:
            raise Exception("Schema Enforcement Failed")

    # reorder to match schema
    return df.select(*schema.fieldNames())

Dataset(my_dataframe, ā€œpersonā€) # will throw an exception if it’s violated

It is worth noting that neither StructType.fromJson nor spark.load.schema(xxxx).table(ā€œxxxā€) properly loads up the metadata from the StructType given to it; this is a known bug in Spark for a long long time. Below is code to get around it.

def fix_schema(schema:StructType) -> StructType:
    # fix the metadata field after StructType.fromJson
    # StructType.fromJson still has a bug in it where it will not load the metadata portion
    fields = []
    for field_data in struct:
        metadata = [k for k in s_json["fields"] if k["name"] == field_data.name][0]
        metadata = metadata["metadata"] if "metadata" in metadata else {}
        new_field = StructField(
            name=field_data.name,
            dataType=field_data.dataType,
            nullable=field_data.nullable,
            metadata=metadata
        )
        fields.append(new_field)
    return StructType(fields)

s_json = …  # our patient json from above
struct = StructType.fromJson(s_json)
struct = fix_schema(struct)

Now we need a simple helper method to validate our custom data quality rules; brilliant engineers will realize that they can extend PySparks DataFrame by monkey-patch; but this will work for our demonstration.

def validate_dataframe(df, schema:StructType) -> None:
    if isinstance(schema, StructType):
        pass
    elif schema == None:
        schema = df.schema
     else:
        raise ValueError("parameter schema is of an invalid value")

    for field in schema.fields:
        col_name = field.name
        col_type = field.dataType
        col_null = field.nullable
        col_metadata = field.metadata
        # TODO: use a factory design pattern over if-elses 
         if "validation" in col_metadata:
             validation = col_metadata[ā€œvalidationā€]
      # do regex and see if it works on the value as a super simple check
             if ā€œregex_matchā€ in validation:
                   pattern = validation[ā€œregex_matchā€] 
                   cnt = df.select(F.regexp_count(F.col(col_name), pattern)).collect()[0][0]
                   if cnt > 0:
                      raise Exception(ā€œfailed on regexā€)
             if ā€œPIIā€ in col_metadata:
                  # TODO: confirm that the field is encrypted
             if ā€œPKā€ in col_metadata:
                  # TODO: confirm that the row has an unique value

Now for the goal of this article, ā€œnot actively being transformedā€. Again, we only want data to be decrypted when and only when it is actively being worked on.  FYI, There will be a performance hit because the encryption algo’s are being run on a per row basis. Let’s get that implemented.

def business_logic(df):
    return (
        df
        .filter(col("ssn").isNotNull())
        .select("id", "ssn")
    )

def with_fernet(
    df: DataFrame,
    columns: list[str],
    fn: Callable[[DataFrame], DataFrame],
) -> DataFrame:
    encrypted_columns  = [c for c in df.schema if ā€œPIIā€ in c or ā€œPHIā€ in c or (ā€œencryptedā€ in c and c[ā€œencryptedā€])]
    return (
        df
        .transform(lambda d: decrypt_columns(d, encrypted_columns)
        .transform(fn)
        .transform(lambda d: encrypt_columns(d, encrypted_columns))
    )
final_df = with_fernet(df, ["ssn"], business_logic)

 

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now