topic Re: Pushdown in Postgres in Data Engineering

Pushdown in Postgres

MikeGo — Tue, 23 Apr 2024 22:23:24 GMT

Hi team,

In Databricks I need to query a postgres source like

select * from postgres_tbl where id in (select id from df)

the df is got from a hive table. If I use JDBC driver, and do

query = '(select * from postgres_tbl) as t' src_df = spark.read.format("postgresql").option("dbtable", query)....

if I join src_df with df, seems no pushdown to postgres query.
I know I can filter or get a sql string by convert df to id string list, but if df has many rows, the query is going to be very long. Is there a good way to make it pushdown or do federated query efficiently?

Thanks

Brad

Re: Pushdown in Postgres

Ajay-Pandey — Wed, 24 Apr 2024 02:42:28 GMT

Hi @MikeGo

Instead of passing the query you can read the postgres table and after that you can filter the dataframe with respective column this also use pushdown filter.

Ex -

remote_table = (spark.read .format("postgresql") .option("dbtable", "schema_name.table_name") # if schema_name not provided, default to "public". .option("host", "database_hostname") .option("port", "5432") # Optional - will use default port 5432 if not specified. .option("database", "database_name") .option("user", "username") .option("password", "password") .load() ).filter(col("id").isin(list)

Re: Pushdown in Postgres

MikeGo — Wed, 24 Apr 2024 04:14:19 GMT

Thanks for response. I cannot do that as we incrementally loading from source very frequently. We cannot read full data each time.