<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How We Built Robust Data Governance at Scale in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/how-we-built-robust-data-governance-at-scale/m-p/149857#M1050</link>
    <description>&lt;P&gt;cannot seem to find Databricks Classification API?&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 05 Mar 2026 03:34:13 GMT</pubDate>
    <dc:creator>Garethcb</dc:creator>
    <dc:date>2026-03-05T03:34:13Z</dc:date>
    <item>
      <title>How We Built Robust Data Governance at Scale</title>
      <link>https://community.databricks.com/t5/community-articles/how-we-built-robust-data-governance-at-scale/m-p/126490#M508</link>
      <description>&lt;P&gt;In today's data-driven world, trust is currency—and that trust starts with &lt;STRONG&gt;quality data governed by strong principles&lt;/STRONG&gt;. For one of our client, where we're on a mission to &lt;STRONG&gt;build intelligent enterprises with AI&lt;/STRONG&gt;, data isn't just an asset—it's a responsibility.&lt;/P&gt;&lt;P&gt;So how do you scale data governance across petabytes, hundreds of users, and global compliance expectations?&lt;/P&gt;&lt;P&gt;Let me take you behind the scenes of how we architected a &lt;STRONG&gt;secure, automated, and scalable data governance framework&lt;/STRONG&gt; using &lt;STRONG&gt;Databricks&lt;/STRONG&gt;, &lt;STRONG&gt;AWS&lt;/STRONG&gt;, and some clever engineering.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":bar_chart:"&gt;📊&lt;/span&gt; Executive View: The Size of the Problem&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;We’re powering AI and data transformation across industries:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Manufacturing&lt;/LI&gt;&lt;LI&gt;Retail &amp;amp; CPG&lt;/LI&gt;&lt;LI&gt;Healthcare &amp;amp; Life Sciences&lt;/LI&gt;&lt;LI&gt;Energy &amp;amp; Sustainability&lt;/LI&gt;&lt;LI&gt;Financial Services&lt;/LI&gt;&lt;LI&gt;Education&amp;nbsp;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;But with great data comes even &lt;STRONG&gt;greater complexity&lt;/STRONG&gt;—different regulations, sensitive data like PII, and cross-functional stakeholders.&lt;/P&gt;&lt;P&gt;That’s where our &lt;STRONG&gt;Data Quality &amp;amp; Governance Initiative&lt;/STRONG&gt; began.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;🧭 Setting the Foundation: What Is Data Governance?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;We defined a comprehensive &lt;STRONG&gt;data governance framework&lt;/STRONG&gt; with these core pillars:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;span class="lia-unicode-emoji" title=":locked_with_key:"&gt;🔐&lt;/span&gt; &lt;STRONG&gt;Data Classification &amp;amp; Cataloging&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;&lt;span class="lia-unicode-emoji" title=":bust_in_silhouette:"&gt;👤&lt;/span&gt; &lt;STRONG&gt;Access Control &amp;amp; Security&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;🧼 &lt;STRONG&gt;Data Quality Management&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;🧬 &lt;STRONG&gt;Metadata Management&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;🧾 &lt;STRONG&gt;Data Lineage &amp;amp; Traceability&lt;/STRONG&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Our goal? Build a &lt;STRONG&gt;self-service, secure, and compliant&lt;/STRONG&gt; environment where business teams can access what they need—without compromising privacy or compliance.&lt;/P&gt;&lt;P&gt;Let’s unpack each of these pillars with real-world implementation details.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Step 1: &lt;/STRONG&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":magnifying_glass_tilted_left:"&gt;🔍&lt;/span&gt; PII Classification: Taming the Sensitive Beast&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Before governance, comes &lt;STRONG&gt;knowing your data&lt;/STRONG&gt;. And that means &lt;STRONG&gt;identifying PII (Personally Identifiable Information)&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Automated PII Detection with Unity Catalog&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;We used Databricks' &lt;STRONG&gt;classify_tables()&lt;/STRONG&gt; function to scan our entire catalog for PII:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-unicode-emoji" title=":white_heavy_check_mark:"&gt;✅&lt;/span&gt; Name&lt;BR /&gt;&lt;span class="lia-unicode-emoji" title=":white_heavy_check_mark:"&gt;✅&lt;/span&gt; Address&lt;BR /&gt;&lt;span class="lia-unicode-emoji" title=":white_heavy_check_mark:"&gt;✅&lt;/span&gt; Phone, Email&lt;BR /&gt;&lt;span class="lia-unicode-emoji" title=":white_heavy_check_mark:"&gt;✅&lt;/span&gt; IPs, SSNs, Photos&lt;BR /&gt;... and more.&lt;/P&gt;&lt;P&gt;Each table’s results were reviewed and confirmed by &lt;STRONG&gt;data owners&lt;/STRONG&gt;—humans + machine learning = &lt;span class="lia-unicode-emoji" title=":hundred_points:"&gt;💯&lt;/span&gt; confidence.&lt;/P&gt;&lt;P&gt;We started by scanning &lt;STRONG&gt;Bronze, Silver, and Gold layers&lt;/STRONG&gt; in our Databricks lakehouse. Using Databricks’ classification APIs, we ran classify_tables() over each catalog and schema in Unity Catalog.&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from databricks.data_classification import classify_tables

# classify on catalog level
results = classify_tables(securable_name="catalog")

# classify on schema level
# results = classify_tables(securable_name="catalog.schema")

# classify on table level
# results = classify_tables(securable_name="catalog.schema.table")

display(results.summary)&lt;/LI-CODE&gt;&lt;P&gt;This gave us a &lt;STRONG&gt;summary DataFrame&lt;/STRONG&gt; of detected PII entities per table, including sample values for verification. No guesswork—just facts.&lt;/P&gt;&lt;P&gt;The output will include two pandas DataFrames: new_classifications and summary, each listing all detected PII entities across the scanned tables. For a detailed breakdown of PII findings per table, access the table_results field as follows:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;display(results.table_results["catalog.schema.table"])&lt;/LI-CODE&gt;&lt;P&gt;This will show a DataFrame where each row corresponds to a column in the table, along with any detected PII entity and up to five sample values. Columns without detected PII will have null values in the pii_entity and samples fields.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Nidhi_Patni_0-1753460966300.png" style="width: 775px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18508i159DAEBDF3B2800F/image-dimensions/775x309?v=v2" width="775" height="309" role="button" title="Nidhi_Patni_0-1753460966300.png" alt="Nidhi_Patni_0-1753460966300.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;This will run Data Classification over all tables under the given securable.&lt;/P&gt;&lt;P&gt;To &lt;U&gt;exclude&lt;/U&gt; any tables or schemas, use the exclude_securables parameters:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;classify_tables(
    securable_name="catalog",
    exclude_securables=[
        "catalog.schema_to_skip",              # Exclude all tables under this schema
        "catalog.other_schema.table_to_skip",  # Exclude only this table
    ],
)&lt;/LI-CODE&gt;&lt;P&gt;&lt;STRONG&gt;Step 2: &lt;span class="lia-unicode-emoji" title=":inbox_tray:"&gt;📥&lt;/span&gt; Client Collaboration: The Human Layer&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;We exported classification results to Excel and &lt;STRONG&gt;shared with data owners&lt;/STRONG&gt;. They reviewed each flagged column and marked them as PII (Yes/No). This added a &lt;STRONG&gt;critical human review loop&lt;/STRONG&gt; into our automated system.&lt;/P&gt;&lt;P&gt;Then came the smart part.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Nidhi_Patni_5-1753461715636.png" style="width: 827px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18513iD76B85768573DF96/image-dimensions/827x288?v=v2" width="827" height="288" role="button" title="Nidhi_Patni_5-1753461715636.png" alt="Nidhi_Patni_5-1753461715636.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Step 3: &lt;span class="lia-unicode-emoji" title=":busts_in_silhouette:"&gt;👥&lt;/span&gt; Role-Based Access for PII&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;We created granular user groups like:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&amp;lt;source_name&amp;gt;_admin&lt;/LI&gt;&lt;LI&gt;&amp;lt;source_name&amp;gt;_user&lt;/LI&gt;&lt;LI&gt;&amp;lt;source_name&amp;gt;_pii&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Only approved roles could see sensitive data. Others saw masked versions. &lt;/STRONG&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":white_heavy_check_mark:"&gt;✅&lt;/span&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;All access was stored in centralized metadata tables—and enforced by code.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Only &amp;lt;source_name&amp;gt;_admin and &amp;lt;source_name&amp;gt;_pii groups could view sensitive fields. Others got a masked version. All this was controlled dynamically via a &lt;STRONG&gt;central masking metadata table&lt;/STRONG&gt; and &lt;STRONG&gt;Databricks UDFs&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Step 4: &lt;span class="lia-unicode-emoji" title=":busts_in_silhouette:"&gt;👥&lt;/span&gt; Define PII-to-User Group Mapping Table&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Created a table to store the information for PII data classification and user group mapping. Below is the sample data for this table.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Nidhi_Patni_2-1753460966324.png" style="width: 819px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18510i4A2767C9B540D96E/image-dimensions/819x346?v=v2" width="819" height="346" role="button" title="Nidhi_Patni_2-1753460966324.png" alt="Nidhi_Patni_2-1753460966324.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Step 5: &lt;span class="lia-unicode-emoji" title=":shield:"&gt;🛡&lt;/span&gt;️&lt;/STRONG&gt;&lt;STRONG&gt;One UDF to Rule Them All&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;We built a single dynamic masking function to apply column-level masking at query time—based on who’s running the query.&lt;/P&gt;&lt;P&gt;This function checked:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;What table is queried&lt;/LI&gt;&lt;LI&gt;What user group the person belongs to&lt;/LI&gt;&lt;LI&gt;What PII columns should be masked&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;span class="lia-unicode-emoji" title=":direct_hit:"&gt;🎯&lt;/span&gt; Result? No duplication. No confusion. Full automation.&lt;/P&gt;&lt;P&gt;No need to create separate UDFs for every PII column and user group combo. This dynamic UDF does all the heavy lifting—smartly adapting to multiple PII columns and user groups in one go. Clean, efficient, and scalable!&lt;/P&gt;&lt;LI-CODE lang="python"&gt;CREATE FUNCTION IF NOT EXISTS &amp;lt;catalog_name&amp;gt;.&amp;lt;schema_name&amp;gt;.&amp;lt;masking_function_name&amp;gt;(COLUMN_NAME STRING, GROUP_NAMES STRING)
                RETURN CASE WHEN EXISTS(SPLIT(GROUP_NAMES, ','), g -&amp;gt; is_member(g)) 
                            THEN COLUMN_NAME
                            ELSE '***-**-****'
                       END;&lt;/LI-CODE&gt;&lt;P&gt;This reduced overhead, simplified debugging, and &lt;STRONG&gt;centralized governance logic&lt;/STRONG&gt; in one place.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Step 6: &lt;span class="lia-unicode-emoji" title=":busts_in_silhouette:"&gt;👥&lt;/span&gt;Dynamic PII Column Masking Based on User Group&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;This function masking() dynamically applies data masking to a specific column in a table based on the provided user group. It extracts the catalog name, schema name, column name, user group information from PII-user-group-mapping table and constructs an ALTER TABLE SQL statement using a custom masking function. This enables flexible, reusable, and scalable masking logic across different tables, columns, and user groups. It also includes error handling to provide detailed feedback in case of failure.&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def masking(CATALOG_NAME, SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, GROUP_NAME):
      spark.sql(f'''ALTER TABLE {CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME} ALTER COLUMN {COLUMN_NAME} SET MASK &amp;lt;catalog_name&amp;gt;.&amp;lt;schema_name&amp;gt;.&amp;lt;masking_function_name&amp;gt; USING COLUMNS ('{GROUP_NAME}');''')
      RETURN f'successfully masked for {CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME} and {COLUMN_NAME}'&lt;/LI-CODE&gt;&lt;P&gt;&lt;STRONG&gt;Step 7: &lt;span class="lia-unicode-emoji" title=":busts_in_silhouette:"&gt;👥&lt;/span&gt;&lt;/STRONG&gt;Member of user group &amp;lt;source_name&amp;gt;_admin and &amp;lt;source_name&amp;gt;_pii will be able to see PII data, while member who is not part of these groups will not be able to see PII data.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Nidhi_Patni_3-1753460966330.png" style="width: 817px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18511i990247F5BC9B5AB1/image-dimensions/817x379?v=v2" width="817" height="379" role="button" title="Nidhi_Patni_3-1753460966330.png" alt="Nidhi_Patni_3-1753460966330.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;🧾 Metadata &amp;amp; Lineage: More Than a Glossary&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;We tracked everything—from column classifications to access mappings and masking status—in a unified metadata repository. It powered:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Self-service catalogs&lt;/LI&gt;&lt;LI&gt;Lineage visualizations&lt;/LI&gt;&lt;LI&gt;Compliance audits&lt;/LI&gt;&lt;LI&gt;Root cause analysis for quality issues&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;No more tribal knowledge. Everything was logged, visualized, and queryable.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":chart_increasing:"&gt;📈&lt;/span&gt; Outcomes: From Chaos to Clarity&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Here’s what we achieved:&lt;/P&gt;&lt;UL class="lia-list-style-type-disc"&gt;&lt;LI&gt;&lt;STRONG&gt;Secure Access:&amp;nbsp;&lt;/STRONG&gt;Only approved groups could see PII. Everyone else saw safe versions.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Auditable Governance:&amp;nbsp;&lt;/STRONG&gt;From classification to masking to access—all changes were tracked and approved.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Scalable Automation:&amp;nbsp;&lt;/STRONG&gt;Single masking function + dynamic role checks = infinite scalability.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Business Confidence:&amp;nbsp;&lt;/STRONG&gt;Teams trusted the data and could innovate without fear.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":direct_hit:"&gt;🎯&lt;/span&gt;Final Thoughts&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Governing data isn’t about saying “no”—it’s about saying “yes” &lt;STRONG&gt;safely&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;With this system, we turned governance from a bottleneck into a &lt;STRONG&gt;business enabler&lt;/STRONG&gt;. And with Databricks, Unity Catalog, and a little creativity, we proved that &lt;STRONG&gt;security, scalability, and simplicity&lt;/STRONG&gt; can coexist.&lt;/P&gt;&lt;P&gt;If your organization is struggling with PII, data sprawl, or compliance—start with &lt;STRONG&gt;classification and automation&lt;/STRONG&gt;. Build once. Scale forever.&lt;/P&gt;&lt;P&gt;&lt;FONT color="#FF6600"&gt;&lt;STRONG&gt;&lt;EM&gt;Note: Data Classification is currently in Beta on AWS and Azure Databricks. Make sure to consult the documentation for the latest implementation steps and Beta conditions.&lt;/EM&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;#Databricks#PIIDataClassification#PIIDataMasking#UnityCatalog&lt;/P&gt;</description>
      <pubDate>Fri, 25 Jul 2025 17:00:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/how-we-built-robust-data-governance-at-scale/m-p/126490#M508</guid>
      <dc:creator>Nidhi_Patni</dc:creator>
      <dc:date>2025-07-25T17:00:42Z</dc:date>
    </item>
    <item>
      <title>Re: How We Built Robust Data Governance at Scale</title>
      <link>https://community.databricks.com/t5/community-articles/how-we-built-robust-data-governance-at-scale/m-p/126495#M510</link>
      <description>&lt;P&gt;Great article&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175274"&gt;@Nidhi_Patni&lt;/a&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 25 Jul 2025 17:16:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/how-we-built-robust-data-governance-at-scale/m-p/126495#M510</guid>
      <dc:creator>sridharplv</dc:creator>
      <dc:date>2025-07-25T17:16:21Z</dc:date>
    </item>
    <item>
      <title>Re: How We Built Robust Data Governance at Scale</title>
      <link>https://community.databricks.com/t5/community-articles/how-we-built-robust-data-governance-at-scale/m-p/126602#M514</link>
      <description>&lt;P&gt;Great Article&amp;nbsp;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175274"&gt;@Nidhi_Patni&lt;/a&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 28 Jul 2025 03:02:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/how-we-built-robust-data-governance-at-scale/m-p/126602#M514</guid>
      <dc:creator>Dr-Sylvester</dc:creator>
      <dc:date>2025-07-28T03:02:29Z</dc:date>
    </item>
    <item>
      <title>Re: How We Built Robust Data Governance at Scale</title>
      <link>https://community.databricks.com/t5/community-articles/how-we-built-robust-data-governance-at-scale/m-p/149857#M1050</link>
      <description>&lt;P&gt;cannot seem to find Databricks Classification API?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 05 Mar 2026 03:34:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/how-we-built-robust-data-governance-at-scale/m-p/149857#M1050</guid>
      <dc:creator>Garethcb</dc:creator>
      <dc:date>2026-03-05T03:34:13Z</dc:date>
    </item>
  </channel>
</rss>

