Databricks Community

Brad · ‎04-23-2024

Hi team,

In Databricks I need to query a postgres source like

select * from postgres_tbl where id in (select id from df)

the df is got from a hive table. If I use JDBC driver, and do

query = '(select * from postgres_tbl) as t'
src_df = spark.read.format("postgresql").option("dbtable", query)....

if I join src_df with df, seems no pushdown to postgres query.
I know I can filter or get a sql string by convert df to id string list, but if df has many rows, the query is going to be very long. Is there a good way to make it pushdown or do federated query efficiently?

Thanks

Brad

Ajay-Pandey · ‎04-23-2024

Hi @Brad

Instead of passing the query you can read the postgres table and after that you can filter the dataframe with respective column this also use pushdown filter.

Ex -

remote_table = (spark.read
  .format("postgresql")
  .option("dbtable", "schema_name.table_name") # if schema_name not provided, default to "public".
  .option("host", "database_hostname")
  .option("port", "5432") # Optional - will use default port 5432 if not specified.
  .option("database", "database_name")
  .option("user", "username")
  .option("password", "password")
  .load()
).filter(col("id").isin(list)

Ajay Kumar Pandey