Databricks

Nurota · ‎01-24-2024

We have a daily job with a notebook that loops through all the databases and tables, and optimizes and vacuums them.
Since in UC DLT tables are materialized views, the "optimize" or "vacuum" commands do not work on them, and they need to be excluded. So I have added a line highlighted in yellow that checks the table type, and if the table type is view or materialized view, skips them. But it throws an error before getting there, on the "describe table extended" command. Apparently, you cannot run the "describe table extended" command on materialized views (DLT tables) and the reason seems to tie to the ownership of the DLT pipeline that creates the materialized views, cluster access mode and principal who runs the code.

It seems like the principal running the job or notebook needs to be an owner of the DLT pipeline in order for it to be able to run the "describe table extended" command on materialized views. And that is also only if it is run in a shared access mode cluster or personal cluster.

I will list the scenarios I tested out:

Originally a scheduled job was set up to run the workflow on a job cluster as the prod service principal. It throws "java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: 403: Your token is missing the required scopes for this endpoint." error on "describe table extended" command. In our workspace, all the DLTs and jobs are set to have the prod service principal as owners and "run as" at this point.
Ran the code in an interactive notebook on my Databricks local workspace with a shared cluster. It throws "com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: User is not an owner of Pipeline '9cec423c-pipeline_id'." error on "describe table extended" command. When I changed pipeline ownership to myself, it succeeded.
Changed the job's cluster from a job cluster to a shared cluster on the scheduled job. It throws "org.apache.spark.SparkException: A pipeline with a different id is already registered in this Credential Scope. The pipeline in a CredentialScope cannot be updated once it is registered." error on "describe table extended" command. All the DLTs and jobs are set to have the prod service principal as owners and "run as" at this point.
Running the code on a personal cluster throws "com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: User is not an owner of Pipeline '9cec423c-pipeline_id'. " error. It works when the DLT pipelines ownership is changed to my name. But scheduled jobs cannot run on personal clusters.

We want all the jobs and DLT pipelines to run under the service principal, and all the jobs to use a job cluster (job clusters are single user access mode clusters).

I reviewed below limitations:

UC cluster mode limitations: https://docs.databricks.com/en/compute/access-mode-limitations.html

Materialized view limitations: https://docs.databricks.com/en/sql/user/materialized-views.html#limitations

Questions:

1. Given that materialized views can only be queried on a shared access mode cluster, why does scenario 3 throw that error? Shouldn't it run successfully since the service principal is running the job in a shared cluster (in a scheduled job) where the service principal is the owner of the DLT pipeline that creates the materialized views?

2. What do each of these error messages mean, especially the error in scenario 3?

3. Is there a solution for this problem (besides hard coding the schemas to be excluded from the loop in the code like in 'information_schema' - additionally, I have regular delta tables in the schema that contain materialized views, and those delta tables need to be included in the "db_tables" list to be optimized or vacuumed)?

4. Is there a way to exclude materialized views in my loop without "describe table extended" (any way to check the type of tables besides "describe table extended")?

Here is the code:

db_tables = []
for database in spark.sql("show databases").collect():
  # Skip the 'information_schema' database
  if database['databaseName'].lower() != 'information_schema':
    for table in spark.sql(f"SHOW TABLES IN {database['databaseName']}").collect():
      # Check if the table is temporary
      if not table['isTemporary']:
         # Use DESCRIBE EXTENDED TABLE to check if it's a view or a table
        desc_table = spark.sql(f"DESCRIBE TABLE EXTENDED {database['databaseName']}.{table['tableName']}").collect()
       
        if any(row['col_name'] == 'Type' and row['data_type'] in ('VIEW', 'MATERIALIZED_VIEW') for row in desc_table):
          continue # Skip if it's a view
        else:
          db_tables.append([database['databaseName'],table['tableName']])

Kaniz · ‎03-18-2024

Hi @Nurota, Let’s delve into the intricacies of Databricks and explore why scenario 3 throws an error despite the shared access mode cluster and the service principal ownership.

Cluster Type and Materialized Views:
- In Databricks, the type of cluster plays a crucial role in how it interacts with materialized views (DLT tables).
- Materialized views can only be queried on shared access mode clusters or personal clusters.
- When you switched the job’s cluster from a job cluster to a shared cluster, you were moving to a suitable cluster type for querying materialized views.
Pipeline Ownership and Pipeline ID:
- The error message you encountered, “A pipeline with a different id is already registered in this Credential Scope,” is related to pipeline ownership and pipeline IDs.
- Each materialized view (DLT) is associated with a specific pipeline ID. This ID is tied to the cluster where the materialized view was created.
- When you changed the cluster type, the pipeline ID changed as well. As a result, the existing pipeline in the shared access mode cluster could not be updated with the new ID.
Principal Permissions and Scopes:
- While the service principal is indeed the owner of the DLT pipeline, the token used by the principal must have the necessary permissions (scopes) to execute certain commands.
- The “describe table extended” command requires specific scopes. If the token lacks these scopes, you’ll encounter the “403: Your token is missing the required scopes for this endpoint” error.
Solutions and Considerations:
- To address this:
  - Pipeline ID Consistency: Ensure consistent pipeline IDs when switching clusters. Avoid changing the cluster type after creating materialized views.
  - Verify Token Scopes: Double-check that the token used by the principal has the required scopes.
  - Alternative Approaches: Explore other ways to retrieve metadata about materialized views without relying solely on “describe table extended.”
  - Dynamic Exclusion: Instead of hard-coding schema exclusions, consider dynamically identifying materialized views based on naming conventions or other attributes.
Conclusion:
- Databricks configurations can be intricate, and achieving the right balance between ownership, access modes, and permissions is essential.
- Adapt these solutions to your specific use case, and don’t hesitate to seek further assistance if needed. You’re navigating a complex landscape, and I’m here to help! 🚀