Pushdown in Postgres
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-23-2024 03:23 PM
Hi team,
In Databricks I need to query a postgres source like
select * from postgres_tbl where id in (select id from df)the df is got from a hive table. If I use JDBC driver, and do
query = '(select * from postgres_tbl) as t'
src_df = spark.read.format("postgresql").option("dbtable", query)....if I join src_df with df, seems no pushdown to postgres query.
I know I can filter or get a sql string by convert df to id string list, but if df has many rows, the query is going to be very long. Is there a good way to make it pushdown or do federated query efficiently?
Thanks
Brad
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-23-2024 07:42 PM
Hi @MikeGo
Instead of passing the query you can read the postgres table and after that you can filter the dataframe with respective column this also use pushdown filter.
Ex -
remote_table = (spark.read
.format("postgresql")
.option("dbtable", "schema_name.table_name") # if schema_name not provided, default to "public".
.option("host", "database_hostname")
.option("port", "5432") # Optional - will use default port 5432 if not specified.
.option("database", "database_name")
.option("user", "username")
.option("password", "password")
.load()
).filter(col("id").isin(list)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-23-2024 09:14 PM
Thanks for response. I cannot do that as we incrementally loading from source very frequently. We cannot read full data each time.