Databricks Community

Nurota · ‎01-24-2024

We have a daily job with a notebook that loops through all the databases and tables, and optimizes and vacuums them.
Since in UC DLT tables are materialized views, the "optimize" or "vacuum" commands do not work on them, and they need to be excluded. So I have added a line highlighted in yellow that checks the table type, and if the table type is view or materialized view, skips them. But it throws an error before getting there, on the "describe table extended" command. Apparently, you cannot run the "describe table extended" command on materialized views (DLT tables) and the reason seems to tie to the ownership of the DLT pipeline that creates the materialized views, cluster access mode and principal who runs the code.

It seems like the principal running the job or notebook needs to be an owner of the DLT pipeline in order for it to be able to run the "describe table extended" command on materialized views. And that is also only if it is run in a shared access mode cluster or personal cluster.

I will list the scenarios I tested out:

Originally a scheduled job was set up to run the workflow on a job cluster as the prod service principal. It throws "java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: 403: Your token is missing the required scopes for this endpoint." error on "describe table extended" command. In our workspace, all the DLTs and jobs are set to have the prod service principal as owners and "run as" at this point.
Ran the code in an interactive notebook on my Databricks local workspace with a shared cluster. It throws "com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: User is not an owner of Pipeline '9cec423c-pipeline_id'." error on "describe table extended" command. When I changed pipeline ownership to myself, it succeeded.
Changed the job's cluster from a job cluster to a shared cluster on the scheduled job. It throws "org.apache.spark.SparkException: A pipeline with a different id is already registered in this Credential Scope. The pipeline in a CredentialScope cannot be updated once it is registered." error on "describe table extended" command. All the DLTs and jobs are set to have the prod service principal as owners and "run as" at this point.
Running the code on a personal cluster throws "com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: User is not an owner of Pipeline '9cec423c-pipeline_id'. " error. It works when the DLT pipelines ownership is changed to my name. But scheduled jobs cannot run on personal clusters.

We want all the jobs and DLT pipelines to run under the service principal, and all the jobs to use a job cluster (job clusters are single user access mode clusters).

I reviewed below limitations:

UC cluster mode limitations: https://docs.databricks.com/en/compute/access-mode-limitations.html

Materialized view limitations: https://docs.databricks.com/en/sql/user/materialized-views.html#limitations

Questions:

1. Given that materialized views can only be queried on a shared access mode cluster, why does scenario 3 throw that error? Shouldn't it run successfully since the service principal is running the job in a shared cluster (in a scheduled job) where the service principal is the owner of the DLT pipeline that creates the materialized views?

2. What do each of these error messages mean, especially the error in scenario 3?

3. Is there a solution for this problem (besides hard coding the schemas to be excluded from the loop in the code like in 'information_schema' - additionally, I have regular delta tables in the schema that contain materialized views, and those delta tables need to be included in the "db_tables" list to be optimized or vacuumed)?

4. Is there a way to exclude materialized views in my loop without "describe table extended" (any way to check the type of tables besides "describe table extended")?

Here is the code:

db_tables = []
for database in spark.sql("show databases").collect():
  # Skip the 'information_schema' database
  if database['databaseName'].lower() != 'information_schema':
    for table in spark.sql(f"SHOW TABLES IN {database['databaseName']}").collect():
      # Check if the table is temporary
      if not table['isTemporary']:
         # Use DESCRIBE EXTENDED TABLE to check if it's a view or a table
        desc_table = spark.sql(f"DESCRIBE TABLE EXTENDED {database['databaseName']}.{table['tableName']}").collect()
       
        if any(row['col_name'] == 'Type' and row['data_type'] in ('VIEW', 'MATERIALIZED_VIEW') for row in desc_table):
          continue # Skip if it's a view
        else:
          db_tables.append([database['databaseName'],table['tableName']])

cgrant · ‎11-26-2024

For this kind of table maintenance, Predictive Optimization is recommended.

If looping over all tables in a metastore is required, it is recommended to instead use system.information_schema.tables as your source instead of calling spark.sql(). For this case in particular, you can filter by table_type to filter out materialized views.

Sidhant07 · ‎12-08-2024

The error in scenario 3 is likely due to the fact that the service principal is not an owner of the DLT pipeline that creates the materialized views. Even though the job is running on a shared cluster, the service principal still needs to be an owner of the pipeline to run the "describe table extended" command on materialized views.
The error messages indicate that the user running the command does not have the necessary permissions to access the materialized views. The error in scenario 3 is likely due to the fact that the service principal is not an owner of the DLT pipeline that creates the materialized views.
One solution for this problem is to use the "information_schema" to exclude materialized views from the loop. You can use the "information_schema.views" table to get a list of all views and materialized views in the database, and then exclude them from the loop. Here's an example of how you can do this:

db_tables = []
for database in spark.sql("show databases").collect():
  # Skip the 'information_schema' database
  if database['databaseName'].lower() != 'information_schema':
    for table in spark.sql(f"SHOW TABLES IN {database['databaseName']}").collect():
      # Check if the table is temporary
      if not table['isTemporary']:
        # Use INFORMATION_SCHEMA.VIEWS to check if it's a view or a table
        desc_table = spark.sql(f"SELECT * FROM {database['databaseName']}.information_schema.views WHERE table_name = '{table['tableName']}'").collect()
       
        if len(desc_table) > 0 and desc_table[0]['table_type'] in ('VIEW', 'MATERIALIZED_VIEW'):
          continue # Skip if it's a view or materialized view
        else:
          db_tables.append([database['databaseName'],table['tableName']])