Re: Getting error while loading parquet data into ...

explorer · ‎01-18-2023

Hi @Kaniz Fatma , @Daniel Sahal -

Few updates from my side.

After so many hits and trials , psycopg2 worked out in my case.

We can process 200+GB data with 10 node cluster (n2-highmem-4,32 GB Memory, 4 Cores) and driver 32 GB Memory, 4 Cores with Runtime10.4.x-scala2.12

It took close to 100 min to load whole data into PG .

Though we need to change parquet to csv and read csv sequentially to load into PG

like the below code snippet -

import psycopg2

# Create a connection to the PostgreSQL database

con = psycopg2.connect(database="dbname",user="user-nm",password="password",host="ip",port="5432")

# Create a cursor object

cur = con.cursor()

# Open the CSV file

files_to_load = dbutils.fs.ls("dbfs:/dir/")

while files_to_load:

file_path = files_to_load.pop(0).path

if file_path.endswith('.csv'):

file_path = file_path.replace("dbfs:","/dbfs")

print(file_path)

with open(file_path, "r") as f:

# Use the copy_from function to load the data into the PostgreSQL table

cur.copy_from(f, "pg_table_name", sep=",")

# Commit the changes

con.commit()

# Close the cursor and connection

con.close()

Few questions if you can assist

1- Is there any way we can tweak the code to read parquet data instead .csv data in above code (we have data originally in parquet format (databricks delta table) and if we can use same native format , we can cut down extra processing )

2- Is there any way we can run parallelly instead sequentially .

3- Any other tips which can boost the performance and take less time .

Regards,