Re: Read just the new file ???

William_Scardua · ‎09-23-2021

Thank you for you feedback @Ryan Chynoweth

For example, imagine that situation:

time1- I have some CSV files landing in my hdfs directory (landing/file1.csv, landing/file2.csv)

time2- My batch PySpark read the hdfs landing directory and write in hdfs bronze directory (bronze/);

time3- New CSV files arrive in hdfs landing directory (landing/file3.csv, landing/file4.csv)

time4- In this point the batch PySpark need to read only are new files (landing/file3.csv, landing/file4.csv) to append to the bonze hdfs directory (bronze/)

In na stream (WriteStream) have the 'checkpointLocation' option, but in na batch ? I need to developer a python control for this situation ?

Can you understand ?

tsk