Databricks Community

dpc · ‎07-08-2025

Hello

Historically, we had a number of tables that have been extracted from source and loaded to databricks using 'select *'.

As a result some columns that have been loaded never get used.

I'd like to tidy this and remove redundant columns.

Is there a way I can script whether columns in a table are referenced (or preferably not) in any downstream databricks notebooks?

Thanks

mhiltner · ‎07-08-2025

Hey @dpc

I don't know if you can do it without customizing a query parser... I don't have this exact script but maybe this could help. First step would be to find out which tables haven't been queried a lot in the past days (in this case, 90 days).

with used_tables as (
select
  source_table_catalog,
  source_table_schema,
  source_table_name,
  count(distinct created_by) as downstream_users,
  count(*) as downstream_dependents
from
  system.access.table_lineage
where
  true
  and source_table_full_name is not null
  and event_time >= date_add(now(), -90)
group by
  all
)

select
  tabelas.table_catalog
  ,tabelas.table_schema
  ,tabelas.table_name
  ,tabelas.table_type
  ,tabelas.table_owner
  ,tabelas.`comment` as table_comment
  ,tabelas.created as table_created_at
  ,tabelas.created_by as table_created_by
  ,tabelas.last_altered as table_last_update_at
  ,tabelas.last_altered_by as table_last_altered_by
from
  system.information_schema.`tables` as tabelas
  left join used_tables as ut on ut.source_table_catalog = tabelas.table_catalog and ut.source_table_schema = tabelas.table_schema and ut.source_table_name = tabelas.table_name
where
  true
  and ut.downstream_dependents is null
  and tabelas.table_catalog != "system"
  and tabelas.table_catalog != "__databricks_internal"
  and tabelas.table_schema != "information_schema"

You could then go deeper into the SQL queries, parsing the statement to see whether people are using "*" or specific columns.

SELECT 
       l.source_table_full_name,
       l.entity_type,
       q.statement_text,
       q.executed_by,
       q.end_time
  FROM system.access.table_lineage  l
  JOIN system.query.history q
  ON l.entity_run_id = q.statement_id
WHERE source_table_full_name = '< insert table name >'

dpc · ‎07-08-2025

Thanks. I'll take a look

The other point here though is that it's mainly ETL notebooks that access the landed data i.e. the silver layer

So, it's really a case of establishing whether the landed column has been used in the creation of a reporting table

If not, we don't need to extract the column

mhiltner · ‎07-08-2025

Are you using Unity Catalog? Because you could leverage the automatic lineage in this case. It shows everything that comes downstream from this silver layer, you could easily check which columns are being used and the lineage.

dpc · ‎07-08-2025

Thanks

Would I have to do that 1 table at a time though?

Lineage is useful but it only shows which tables used the table. It doesn't actually show the column used unless you go into the notebook. Unless I am missign something here?

Databricks Community

Can I see whether a table column is read in databricks?

Join Us as a Local Community Builder!

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Level Up with Databricks Specialist Sessions

🌟 Community Pulse: Your Weekly Roundup! November 07 – 13, 2025

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐