Hi,
I need to process nearly 30 files from different locations and insert records to RDS.
I am using multi-threading to process these files parallelly like below.
def process_files(file_path):
<process files here>
1. Find bad records based on field validation
2. Find good records based on field validation
3. Insert only good records to the RDS
4. Write good records to COMPLETED folder and bad records to ERROR folder (These files should be written to the same location where original file is present)
pool = ThreadPool(len(files_list))
pool.map(process_files, ((file_path) for file_path in files_list))
Questions:
1. Is it good approach to process files like this?
2. If files size is more(each files 1GB) we get OOM(Out Of Memory issue) - cluster config: driver: i3.4xlarge(16 GB) and 4 worker nodes with same size. How we need to process files in this case?