We have a daily job with a notebook that loops through all the databases and tables, and optimizes and vacuums them.
Since in UC DLT tables are materialized views, the "optimize" or "vacuum" commands do not work on them, and they need to be excluded. So I have added a line highlighted in yellow that checks the table type, and if the table type is view or materialized view, skips them. But it throws an error before getting there, on the "describe table extended" command. Apparently, you cannot run the "describe table extended" command on materialized views (DLT tables) and the reason seems to tie to the ownership of the DLT pipeline that creates the materialized views, cluster access mode and principal who runs the code.
It seems like the principal running the job or notebook needs to be an owner of the DLT pipeline in order for it to be able to run the "describe table extended" command on materialized views. And that is also only if it is run in a shared access mode cluster or personal cluster.
I will list the scenarios I tested out:
- Originally a scheduled job was set up to run the workflow on a job cluster as the prod service principal. It throws "java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: 403: Your token is missing the required scopes for this endpoint." error on "describe table extended" command. In our workspace, all the DLTs and jobs are set to have the prod service principal as owners and "run as" at this point.
- Ran the code in an interactive notebook on my Databricks local workspace with a shared cluster. It throws "com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: User is not an owner of Pipeline '9cec423c-pipeline_id'." error on "describe table extended" command. When I changed pipeline ownership to myself, it succeeded.
- Changed the job's cluster from a job cluster to a shared cluster on the scheduled job. It throws "org.apache.spark.SparkException: A pipeline with a different id is already registered in this Credential Scope. The pipeline in a CredentialScope cannot be updated once it is registered." error on "describe table extended" command. All the DLTs and jobs are set to have the prod service principal as owners and "run as" at this point.
- Running the code on a personal cluster throws "com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: User is not an owner of Pipeline '9cec423c-pipeline_id'. " error. It works when the DLT pipelines ownership is changed to my name. But scheduled jobs cannot run on personal clusters.
We want all the jobs and DLT pipelines to run under the service principal, and all the jobs to use a job cluster (job clusters are single user access mode clusters).
I reviewed below limitations:
UC cluster mode limitations: https://docs.databricks.com/en/compute/access-mode-limitations.html
Materialized view limitations: https://docs.databricks.com/en/sql/user/materialized-views.html#limitations
Questions:
1. Given that materialized views can only be queried on a shared access mode cluster, why does scenario 3 throw that error? Shouldn't it run successfully since the service principal is running the job in a shared cluster (in a scheduled job) where the service principal is the owner of the DLT pipeline that creates the materialized views?
2. What do each of these error messages mean, especially the error in scenario 3?
3. Is there a solution for this problem (besides hard coding the schemas to be excluded from the loop in the code like in 'information_schema' - additionally, I have regular delta tables in the schema that contain materialized views, and those delta tables need to be included in the "db_tables" list to be optimized or vacuumed)?
4. Is there a way to exclude materialized views in my loop without "describe table extended" (any way to check the type of tables besides "describe table extended")?
Here is the code:
db_tables = []
for database in spark.sql("show databases").collect():
# Skip the 'information_schema' database
if database['databaseName'].lower() != 'information_schema':
for table in spark.sql(f"SHOW TABLES IN {database['databaseName']}").collect():
# Check if the table is temporary
if not table['isTemporary']:
# Use DESCRIBE EXTENDED TABLE to check if it's a view or a table
desc_table = spark.sql(f"DESCRIBE TABLE EXTENDED {database['databaseName']}.{table['tableName']}").collect()
if any(row['col_name'] == 'Type' and row['data_type'] in ('VIEW', 'MATERIALIZED_VIEW') for row in desc_table):
continue # Skip if it's a view
else:
db_tables.append([database['databaseName'],table['tableName']])