- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-23-2021 10:24 AM
Thank you for you feedback @Ryan Chynoweth
For example, imagine that situation:
time1- I have some CSV files landing in my hdfs directory (landing/file1.csv, landing/file2.csv)
time2- My batch PySpark read the hdfs landing directory and write in hdfs bronze directory (bronze/);
time3- New CSV files arrive in hdfs landing directory (landing/file3.csv, landing/file4.csv)
time4- In this point the batch PySpark need to read only are new files (landing/file3.csv, landing/file4.csv) to append to the bonze hdfs directory (bronze/)
In na stream (WriteStream) have the 'checkpointLocation' option, but in na batch ? I need to developer a python control for this situation ?
Can you understand ?
tsk