โ07-08-2025 02:55 AM
Hello
Historically, we had a number of tables that have been extracted from source and loaded to databricks using 'select *'.
As a result some columns that have been loaded never get used.
I'd like to tidy this and remove redundant columns.
Is there a way I can script whether columns in a table are referenced (or preferably not) in any downstream databricks notebooks?
Thanks
โ07-08-2025 03:05 AM
Hey @dpc
I don't know if you can do it without customizing a query parser... I don't have this exact script but maybe this could help. First step would be to find out which tables haven't been queried a lot in the past days (in this case, 90 days).
with used_tables as (
select
source_table_catalog,
source_table_schema,
source_table_name,
count(distinct created_by) as downstream_users,
count(*) as downstream_dependents
from
system.access.table_lineage
where
true
and source_table_full_name is not null
and event_time >= date_add(now(), -90)
group by
all
)
select
tabelas.table_catalog
,tabelas.table_schema
,tabelas.table_name
,tabelas.table_type
,tabelas.table_owner
,tabelas.`comment` as table_comment
,tabelas.created as table_created_at
,tabelas.created_by as table_created_by
,tabelas.last_altered as table_last_update_at
,tabelas.last_altered_by as table_last_altered_by
from
system.information_schema.`tables` as tabelas
left join used_tables as ut on ut.source_table_catalog = tabelas.table_catalog and ut.source_table_schema = tabelas.table_schema and ut.source_table_name = tabelas.table_name
where
true
and ut.downstream_dependents is null
and tabelas.table_catalog != "system"
and tabelas.table_catalog != "__databricks_internal"
and tabelas.table_schema != "information_schema"
You could then go deeper into the SQL queries, parsing the statement to see whether people are using "*" or specific columns.
SELECT
l.source_table_full_name,
l.entity_type,
q.statement_text,
q.executed_by,
q.end_time
FROM system.access.table_lineage l
JOIN system.query.history q
ON l.entity_run_id = q.statement_id
WHERE source_table_full_name = '< insert table name >'
โ07-08-2025 04:06 AM
Thanks. I'll take a look
The other point here though is that it's mainly ETL notebooks that access the landed data i.e. the silver layer
So, it's really a case of establishing whether the landed column has been used in the creation of a reporting table
If not, we don't need to extract the column
โ07-08-2025 04:28 AM
Are you using Unity Catalog? Because you could leverage the automatic lineage in this case. It shows everything that comes downstream from this silver layer, you could easily check which columns are being used and the lineage.
โ07-08-2025 05:47 AM
Thanks
Would I have to do that 1 table at a time though?
Lineage is useful but it only shows which tables used the table. It doesn't actually show the column used unless you go into the notebook. Unless I am missign something here?
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now