cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Pushdown in Postgres

Brad
Contributor

Hi team,

In Databricks I need to query a postgres source like

select * from postgres_tbl where id in (select id from df)

the df is got from a hive table. If I use JDBC driver, and do

query = '(select * from postgres_tbl) as t'
src_df = spark.read.format("postgresql").option("dbtable", query)....

if I join src_df with df, seems no pushdown to postgres query. 
I know I can filter or get a sql string by convert df to id string list, but if df has many rows, the query is going to be very long. Is there a good way to make it pushdown or do federated query efficiently?

Thanks

Brad

2 REPLIES 2

Ajay-Pandey
Esteemed Contributor III

Hi @Brad 

Instead of passing the query you can read the postgres table and after that you can filter the dataframe with respective column this also use pushdown filter.

Ex - 

remote_table = (spark.read
  .format("postgresql")
  .option("dbtable", "schema_name.table_name") # if schema_name not provided, default to "public".
  .option("host", "database_hostname")
  .option("port", "5432") # Optional - will use default port 5432 if not specified.
  .option("database", "database_name")
  .option("user", "username")
  .option("password", "password")
  .load()
).filter(col("id").isin(list)

Brad
Contributor

Thanks for response. I cannot do that as we incrementally loading from source very frequently. We cannot read full data each time.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.