cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Can I see whether a table column is read in databricks?

dpc
New Contributor III

Hello

Historically, we had a number of tables that have been extracted from source and loaded to databricks using 'select *'.

As a result some columns that have been loaded never get used.

I'd like to tidy this and remove redundant columns.

 

Is there a way I can script whether columns in a table are referenced (or preferably not) in any downstream databricks notebooks?

 

Thanks

 

4 REPLIES 4

mhiltner
Databricks Employee
Databricks Employee

Hey @dpc 

I don't know if you can do it without customizing a query parser... I don't have this exact script but maybe this could help. First step would be to find out which tables haven't been queried a lot in the past days (in this case, 90 days). 

 

with used_tables as (
select
  source_table_catalog,
  source_table_schema,
  source_table_name,
  count(distinct created_by) as downstream_users,
  count(*) as downstream_dependents
from
  system.access.table_lineage
where
  true
  and source_table_full_name is not null
  and event_time >= date_add(now(), -90)
group by
  all
)

select
  tabelas.table_catalog
  ,tabelas.table_schema
  ,tabelas.table_name
  ,tabelas.table_type
  ,tabelas.table_owner
  ,tabelas.`comment` as table_comment
  ,tabelas.created as table_created_at
  ,tabelas.created_by as table_created_by
  ,tabelas.last_altered as table_last_update_at
  ,tabelas.last_altered_by as table_last_altered_by
from
  system.information_schema.`tables` as tabelas
  left join used_tables as ut on ut.source_table_catalog = tabelas.table_catalog and ut.source_table_schema = tabelas.table_schema and ut.source_table_name = tabelas.table_name
where
  true
  and ut.downstream_dependents is null
  and tabelas.table_catalog != "system"
  and tabelas.table_catalog != "__databricks_internal"
  and tabelas.table_schema != "information_schema"

 You could then go deeper into the SQL queries, parsing the statement to see whether people are using "*" or specific columns.  

SELECT 
       l.source_table_full_name,
       l.entity_type,
       q.statement_text,
       q.executed_by,
       q.end_time
  FROM system.access.table_lineage  l
  JOIN system.query.history q
  ON l.entity_run_id = q.statement_id
WHERE source_table_full_name = '< insert table name >'

 

dpc
New Contributor III

Thanks. I'll take a look

The other point here though is that it's mainly ETL notebooks that access the landed data i.e. the silver layer

So, it's really a case of establishing whether the landed column has been used in the creation of a reporting table

If not, we don't need to extract the column

mhiltner
Databricks Employee
Databricks Employee

Are you using Unity Catalog? Because you could leverage the automatic lineage in this case. It shows everything that comes downstream from this silver layer, you could easily check which columns are being used and the lineage. 

dpc
New Contributor III

Thanks

Would I have to do that 1 table at a time though?

Lineage is useful but it only shows which tables used the table. It doesn't actually show the column used unless you go into the notebook. Unless I am missign something here?

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now