โ05-25-2023 02:57 AM
When ever I am trying to run and load multiple files in single dataframe for processing (overall file size is more than 15 gb in single dataframe at the end of the loop, my code is crashing everytime with the below error...
ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.
Please help me to fix it. Below is my code
df2= pd.DataFrame()
for i in range(0, k):
df1= pd.DataFrame()
for j in pd.date_range(start_date, periods=5):
print(i, start_date)
path = r'/dbfs/mnt/xxxx/***/Ixxxx/***/'
path1 = os.path.join(path,'XXXX_'+ start_date +'.csv')
if os.path.isfile(path1):
df= pd.read_csv(path1, low_memory=False)
df= df.drop(['Var1', 'Var2', 'Var3'], axis=1)
df= df.drop_duplicates(keep='first')
df.reset_index(drop=True, inplace=True)
df.set_index('VmsNo', inplace=True)
df1= df1.append(df)
start_date = (pd.Timestamp(start_date)- pd.DateOffset(days=1)).strftime('%Y%m%d')
df2= df2.append(df1)
โ05-26-2023 03:27 AM
@Satish Agarwalโ It seems your system memory is not sufficient to load the 15GB file. I believe you are using Python Pandas data frame for loading 15GB file and not using Spark. Is there any particular reason that you cannot use Spark for this.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.