01-24-2024 10:22 AM
We want all the jobs and DLT pipelines to run under the service principal, and all the jobs to use a job cluster (job clusters are single user access mode clusters).
I reviewed below limitations:
UC cluster mode limitations: https://docs.databricks.com/en/compute/access-mode-limitations.html
Materialized view limitations: https://docs.databricks.com/en/sql/user/materialized-views.html#limitations
Questions:
1. Given that materialized views can only be queried on a shared access mode cluster, why does scenario 3 throw that error? Shouldn't it run successfully since the service principal is running the job in a shared cluster (in a scheduled job) where the service principal is the owner of the DLT pipeline that creates the materialized views?
2. What do each of these error messages mean, especially the error in scenario 3?
3. Is there a solution for this problem (besides hard coding the schemas to be excluded from the loop in the code like in 'information_schema' - additionally, I have regular delta tables in the schema that contain materialized views, and those delta tables need to be included in the "db_tables" list to be optimized or vacuumed)?
4. Is there a way to exclude materialized views in my loop without "describe table extended" (any way to check the type of tables besides "describe table extended")?
Here is the code:
db_tables = []
for database in spark.sql("show databases").collect():
# Skip the 'information_schema' database
if database['databaseName'].lower() != 'information_schema':
for table in spark.sql(f"SHOW TABLES IN {database['databaseName']}").collect():
# Check if the table is temporary
if not table['isTemporary']:
# Use DESCRIBE EXTENDED TABLE to check if it's a view or a table
desc_table = spark.sql(f"DESCRIBE TABLE EXTENDED {database['databaseName']}.{table['tableName']}").collect()
if any(row['col_name'] == 'Type' and row['data_type'] in ('VIEW', 'MATERIALIZED_VIEW') for row in desc_table):
continue # Skip if it's a view
else:
db_tables.append([database['databaseName'],table['tableName']])
11-26-2024 11:10 AM - edited 11-26-2024 11:11 AM
For this kind of table maintenance, Predictive Optimization is recommended.
If looping over all tables in a metastore is required, it is recommended to instead use system.information_schema.tables as your source instead of calling spark.sql(). For this case in particular, you can filter by table_type to filter out materialized views.
12-08-2024 11:41 PM
db_tables = []
for database in spark.sql("show databases").collect():
# Skip the 'information_schema' database
if database['databaseName'].lower() != 'information_schema':
for table in spark.sql(f"SHOW TABLES IN {database['databaseName']}").collect():
# Check if the table is temporary
if not table['isTemporary']:
# Use INFORMATION_SCHEMA.VIEWS to check if it's a view or a table
desc_table = spark.sql(f"SELECT * FROM {database['databaseName']}.information_schema.views WHERE table_name = '{table['tableName']}'").collect()
if len(desc_table) > 0 and desc_table[0]['table_type'] in ('VIEW', 'MATERIALIZED_VIEW'):
continue # Skip if it's a view or materialized view
else:
db_tables.append([database['databaseName'],table['tableName']])