Hey @dpc
I don't know if you can do it without customizing a query parser... I don't have this exact script but maybe this could help. First step would be to find out which tables haven't been queried a lot in the past days (in this case, 90 days).
with used_tables as (
select
source_table_catalog,
source_table_schema,
source_table_name,
count(distinct created_by) as downstream_users,
count(*) as downstream_dependents
from
system.access.table_lineage
where
true
and source_table_full_name is not null
and event_time >= date_add(now(), -90)
group by
all
)
select
tabelas.table_catalog
,tabelas.table_schema
,tabelas.table_name
,tabelas.table_type
,tabelas.table_owner
,tabelas.`comment` as table_comment
,tabelas.created as table_created_at
,tabelas.created_by as table_created_by
,tabelas.last_altered as table_last_update_at
,tabelas.last_altered_by as table_last_altered_by
from
system.information_schema.`tables` as tabelas
left join used_tables as ut on ut.source_table_catalog = tabelas.table_catalog and ut.source_table_schema = tabelas.table_schema and ut.source_table_name = tabelas.table_name
where
true
and ut.downstream_dependents is null
and tabelas.table_catalog != "system"
and tabelas.table_catalog != "__databricks_internal"
and tabelas.table_schema != "information_schema"
You could then go deeper into the SQL queries, parsing the statement to see whether people are using "*" or specific columns.
SELECT
l.source_table_full_name,
l.entity_type,
q.statement_text,
q.executed_by,
q.end_time
FROM system.access.table_lineage l
JOIN system.query.history q
ON l.entity_run_id = q.statement_id
WHERE source_table_full_name = '< insert table name >'