5 hours ago
See the comments below for a runnable notebook
Throughout my career I have worked at several companies that handle sensitive data; including PII, PHI, EMR, HIPPA, Class I/II/III FOMC - Internal (FR). One entity I worked at even required a Department Of Justice background check. Engineers handling sensitive data must consider how it may be exposed and develop safeguards to minimize risk. Protecting PII (personally identifiable information) is very important as the number of data breaches and records with sensitive information exposed every day is trending upwards; do you really want your private personal medical details exposed to the internet, or even be partially liable for it due to bad coding and practices?! Per HIPAA, āat restā and āin motionā must be encrypted; these requirements are the bare minimum and often insufficient, according to the author's opinion. From the perspective of typical Databricks pipeline often that means the cloud storage bucket has default encryption turned on and thatās about it.
It is the authors opinion that to create truly ethically compliant pipelines there is another requirement: data should be encrypted when āat restā, āin motionā, and ānot actively being transformedā. In other words, if a Databricks Notebook (hereinafter āpipelineā) loads patientsā data, does its necessary transformations but none of those transformations are on a patient's first or last name, then at no time during that pipelines lifetime should either of those fields be decrypted.
Data governance is a framework of policies, processes, roles, and standards that ensures an organization's data is secure, accurate, available, and usable throughout its entire lifecycle, guiding how data is collected, stored, accessed, and disposed of to meet business goals and comply with regulations. It acts as a central system, defining who can do what with data, ensuring data quality, preventing inconsistencies, and building trust in data assets for better decision-making
Ditto, but with Spark schemas. Scalaās Spark implementation has a class called Dataset[T], it implements dataset schema governance through a combination of robust type-safety mechanisms like case classes and Scalaās strong typing language. Using Dataset[T] in Scala, it is super easy to enforce strict or non-strict compliance as interfaces on methods.
Example Spark StructType defined in JSON
{
"fields": [
{
"name": "ID",
"nullable": false,
"type": "string",
"metadata": {"PK":true, "comment": "PK", "validation":{"regex_match":"/d{10}"}}
},
{
"name": "FirstName",
"nullable": false,
"type": "string",
"metadata": {āPIIā:true}
},
{
"name": "SSN",
"nullable": false,
"type": "string",
"metadata": {āPIIā:true}
}],
"type": "struct"
}
Example of Dataset[T] via CaseClass generated via StructType
def limitDatasetToPersonName(peopleDS: Dataset[Person]): Dataset[PersonPHI] = {
šŖšŖšŖ
}
def transformData(sourceDS: Dataset[SourceData]): Dataset[TargetData] = {
šŖšŖšŖ
}Unfortunately, neither Python nor PySpark have these superman abilities; implementing compliance into PySpark will be necessary later.
Data isolation is the practice of separating data sets to prevent unauthorized access, limit the impact of breaches, and maintain system stability, ensuring one user's or application's data remains invisible and inaccessible to others, even on shared infrastructure like the cloud. It's achieved through logical controls (access rules, authentication) or physical separation (dedicated servers) and is crucial for security, performance, and compliance.
Data Isolation in Databricks is a multi-layered strategy used to separate data and AI assets to ensure security, compliance, and transaction integrity. Think of isolation as layers, not a single switch. The higher you isolate, the smaller the blast radius. HIPPAs legal compliance can be achieved at any of these levels.
Level | Strength | Use Case |
Account | āāāāā | Strongest |
Workspace | āāāā | Recommended |
Network | āāāā | Defense-in-depth |
Storage | āāāā | Required |
Compute | āāā | Sensitive jobs |
UC Catalog | āā | Governance |
UC Schema | ā | Organization |
Table | ā | Least privilege |
Account and Workspace would appear to be the no-brainer choice (these options only became available in DataBricks late 2020). At time of writing according to Databricks resource limit document. āa user can't belong to more than 50 Databricks accountsā (fixed), and each account can have 50 workspaces (not fixed). The number of workspaces is tied more to the underlying cloud provider's subscription limits than a hard limit imposed by Databricks. These limits are on resources like vCPUs, network infrastructure, or the number of virtual machines that can be launched. After increasing the limits on the cloud provider, Databricks can then increase those limits on their end. A quick calculation, each customer will most likely need a QA, DEV, and PROD environment. Hence; for 100 customers, we would need 300 workspaces. An astute engineer would quickly realize that they would want an infrastructure-as-code tool like Terraform first and ensure they can provision 300 workflows first. Here is a Databricks blog regarding best practices in platform architecture that is worth reading.
ā
Safest Interpretation
One Databricks account, one workspace per customer
Dedicated cloud resources (VNet/VPC)
Unity Catalog scoped to that workspace only
No shared metastores or storage locations
This is the cleanest HIPAA story for auditors
Following this alone does not necessarily equate to HIPPA Safe Harbor.
While multiple workspaces are a recommended pattern for strong isolation (especially dev/staging/prod or different business units), they introduce real operational trade-offs compared to a single (or very few) workspace approach ā especially when using Unity Catalog for centralized governance.
# | Disadvantage | Explanation / Impact | Severity (typical) |
1 | Increased administrative & operational overhead | Every workspace needs separate setup, maintenance, user/group management, cluster policies, IP ACLs, PrivateLink config (if used), etc. Manual management becomes painful quickly. | High |
2 | Harder to automate & higher IaC complexity | You must use Infrastructure-as-Code (Terraform / Databricks Asset Bundles / APIs) to keep environments consistent ā otherwise chaos ensues. | High |
3 | Worse collaboration & knowledge sharing | Notebooks, dashboards, experiments, and jobs cannot be easily shared/collaborated on across workspaces. Cross-workspace lineage visibility and discovery are more limited. | MediumāHigh |
4 | Higher cost (direct + indirect) | More workspaces ā more small/medium clusters ā potentially higher DBU consumption due to less pooling. Also indirect costs from more admin time and setup. | Medium |
5 | Workspace resource limits become relevant | Each workspace has its own per-workspace limits (concurrent clusters, jobs, users, API rate limits, etc.). Heavy usage in one workspace won't affect others, but you hit limits faster per workspace. | Medium |
6 | Account-level workspace limit | Default hard/soft limit ~50 workspaces per account (enterprise tier). Large orgs with dozens of teams ā may need multiple Databricks accounts ā even more complexity. | MediumāHigh (large orgs) |
7 | Switching friction for users | Users must constantly switch workspaces in the UI ā annoying for people who need to work across environments/teams. | Medium |
8 | More complex monitoring & observability | Harder to get a unified view of jobs/clusters/costs/performance across many workspaces (requires account-level tools or extra setup). | Medium |
9 | Slower onboarding & higher learning curve for new users | New team members need to understand which workspace does what, how permissions differ, etc. | LowāMedium |
10 | Some features are harder/more limited | Certain things (e.g. some legacy feature store patterns, cross-workspace notebook collaboration, shared SQL warehouses) become more complicated or impossible. | LowāMedium |
11 | Unity Catalog not workspaces | Unity Catalog is the real protector of the data; not workspaces. Data can gain visibility where it should not through multiple holes. | Medium |
Bottom line as of 2025 is that the sweet spot for most organizations is 3ā12 workspaces (typically dev / staging / prod + maybe 1ā2 sandboxes or department-specific ones).
Going beyond ~20ā30 usually signals the necessity to heavily invest in automation and a good full time DevOps engineer, or two, to own it; question whether every new request really needs its own workspace ā many times catalog-level isolation + cluster policies + Unity Catalog bindings is enough.
If you're hitting scaling pain with multiple workspaces today, consider consolidating where possible and leaning harder on Unity Catalog + workspace-catalog bindings + Schema Governance + Encryption UDFs for isolation instead of proliferating more workspaces.
For all practical purposes, data isolation is achieved when it cannot be read by an unauthorized entity.
Read more here about why HIPPA Safe Harbor is the gold standard; if you strictly follow the 18-identifier removal list, HIPAA guarantees that the data is legally de-identified, so the entity cannot be penalized for HIPAA violations for using or sharing that data. HIPAA's encryption is a rule under the HITECH Act that exempts organizations from mandatory breach notification if stolen or lost electronic Protected Health Information (ePHI) is rendered unusable, unreadable, or indecipherable through strong encryption, often aligning with NIST standards (like AES), meaning no breach is deemed to have occurred, provided encryption keys aren't compromised. To qualify, all ePHI, both at rest (stored) and in transit (moving), must be encrypted using robust methods, often requiring strong passcodes and key management, to ensure data remains secure even if devices are lost or accessed without authorization. FYI, there are other details too that are important not mentioned here.
First, we need the ability to have a mechanism to encrypt and decrypt data in a DataFrame, here is a UDF that gets us there. Next, every customer has a service principal identity with one and only one key registered within the secret store. Under this situation, we have encryption level data isolation, even if one customer got another customers data on accident, they still cannot read it since they lack the correct key to decrypt it.
from cryptography.fernet import Fernet
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# Load once - do this in your notebook or job initialization
FERNET_KEY = dbutils.secrets.get(scope="crypto", key=f"{current_user}-fernet-main-key-2025")
fernet_singleton = Fernet(FERNET_KEY.encode())
def encrypt_fernet(plain_text):
if plain_text is None:
return None
return cipher_suite.encrypt(plain_text.encode()).decode()
def decrypt_fernet(encrypted_text):
if encrypted_text is None:
return None
return cipher_suite.decrypt(encrypted_text.encode()).decode()
def decrypt_columns(df: DataFrame, columns: list[str]) -> DataFrame:
return df.transform(
lambda d: d.select(
*[
decrypt_fernet(col(c)).alias(c) if c in columns else col(c)
for c in d.columns
]
)
)
def encrypt_columns(df: DataFrame, columns: list[str]) -> DataFrame:
return df.transform(
lambda d: d.select(
*[
encrypt_fernet(col(c)).alias(c) if c in columns else col(c)
for c in d.columns
]
)
)
As mentioned above, PySpark does not have Scalaās Dataset class. Here is a minimal implementation of it in PySpark. Scala being a strong static type language will throw an exception at compile time, best PySpark can do is at runtime, unfortunately.
def Dataset(df, schema:StructType, strict=true) -> Dataframe:
# Emulates Scalaās Dataset
if strict and len(schema.fieldNames()) != len(df.columns):
raise Exception("not equal number of columns between the two")
for field in schema.fields:
col_name = field.name
col_type = field.dataType
col_null = field.nullable
if col_name in df.columns:
if not strict:
continue
# check for existence
if not (col_name in schema.fieldNames()):
raise Exception(f"column missing: {col_name}")
elif col_type != df.schema[col_name].dataType:
raise Exception(
f"column type is incorrect for '{col_name}', expected '{col_type}', got '{df.schema[col_name].dataType}'. "
)
elif not col_null:
null_count = df.filter(df[col_name].isNull()).count()
if null_count > 0:
raise Exception("Null Constraint Violation")
else:
raise Exception("Schema Enforcement Failed")
# reorder to match schema
return df.select(*schema.fieldNames())
Dataset(my_dataframe, āpersonā) # will throw an exception if itās violatedIt is worth noting that neither StructType.fromJson nor spark.load.schema(xxxx).table(āxxxā) properly loads up the metadata from the StructType given to it; this is a known bug in Spark for a long long time. Below is code to get around it.
def fix_schema(schema:StructType) -> StructType:
# fix the metadata field after StructType.fromJson
# StructType.fromJson still has a bug in it where it will not load the metadata portion
fields = []
for field_data in struct:
metadata = [k for k in s_json["fields"] if k["name"] == field_data.name][0]
metadata = metadata["metadata"] if "metadata" in metadata else {}
new_field = StructField(
name=field_data.name,
dataType=field_data.dataType,
nullable=field_data.nullable,
metadata=metadata
)
fields.append(new_field)
return StructType(fields)
s_json = ⦠# our patient json from above
struct = StructType.fromJson(s_json)
struct = fix_schema(struct)Now we need a simple helper method to validate our custom data quality rules; brilliant engineers will realize that they can extend PySparks DataFrame by monkey-patch; but this will work for our demonstration.
def validate_dataframe(df, schema:StructType) -> None:
if isinstance(schema, StructType):
pass
elif schema == None:
schema = df.schema
else:
raise ValueError("parameter schema is of an invalid value")
for field in schema.fields:
col_name = field.name
col_type = field.dataType
col_null = field.nullable
col_metadata = field.metadata
# TODO: use a factory design pattern over if-elses
if "validation" in col_metadata:
validation = col_metadata[āvalidationā]
# do regex and see if it works on the value as a super simple check
if āregex_matchā in validation:
pattern = validation[āregex_matchā]
cnt = df.select(F.regexp_count(F.col(col_name), pattern)).collect()[0][0]
if cnt > 0:
raise Exception(āfailed on regexā)
if āPIIā in col_metadata:
# TODO: confirm that the field is encrypted
if āPKā in col_metadata:
# TODO: confirm that the row has an unique valueNow for the goal of this article, ānot actively being transformedā. Again, we only want data to be decrypted when and only when it is actively being worked on. FYI, There will be a performance hit because the encryption algoās are being run on a per row basis. Letās get that implemented.
def business_logic(df):
return (
df
.filter(col("ssn").isNotNull())
.select("id", "ssn")
)
def with_fernet(
df: DataFrame,
columns: list[str],
fn: Callable[[DataFrame], DataFrame],
) -> DataFrame:
encrypted_columns = [c for c in df.schema if āPIIā in c or āPHIā in c or (āencryptedā in c and c[āencryptedā])]
return (
df
.transform(lambda d: decrypt_columns(d, encrypted_columns)
.transform(fn)
.transform(lambda d: encrypt_columns(d, encrypted_columns))
)
final_df = with_fernet(df, ["ssn"], business_logic)
Passionate about hosting events and connecting people? Help us grow a vibrant local communityāsign up today to get started!
Sign Up Now