- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2023 07:44 AM
Hi @Kaniz Fatma , @Daniel Sahal -
Few updates from my side.
After so many hits and trials , psycopg2 worked out in my case.
We can process 200+GB data with 10 node cluster (n2-highmem-4,32 GB Memory, 4 Cores) and driver 32 GB Memory, 4 Cores with Runtime10.4.x-scala2.12
It took close to 100 min to load whole data into PG .
Though we need to change parquet to csv and read csv sequentially to load into PG
like the below code snippet -
import psycopg2
# Create a connection to the PostgreSQL database
con = psycopg2.connect(database="dbname",user="user-nm",password="password",host="ip",port="5432")
# Create a cursor object
cur = con.cursor()
# Open the CSV file
files_to_load = dbutils.fs.ls("dbfs:/dir/")
while files_to_load:
file_path = files_to_load.pop(0).path
if file_path.endswith('.csv'):
file_path = file_path.replace("dbfs:","/dbfs")
print(file_path)
with open(file_path, "r") as f:
# Use the copy_from function to load the data into the PostgreSQL table
cur.copy_from(f, "pg_table_name", sep=",")
# Commit the changes
con.commit()
# Close the cursor and connection
con.close()
Few questions if you can assist
1- Is there any way we can tweak the code to read parquet data instead .csv data in above code (we have data originally in parquet format (databricks delta table) and if we can use same native format , we can cut down extra processing )
2- Is there any way we can run parallelly instead sequentially .
3- Any other tips which can boost the performance and take less time .
Regards,