Is it good to process files in multithreading?

Community Platform Discussions

Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.

Hi,

I need to process nearly 30 files from different locations and insert records to RDS.

I am using multi-threading to process these files parallelly like below.

def process_files(file_path):

1. Find bad records based on field validation

2. Find good records based on field validation

3. Insert only good records to the RDS

4. Write good records to COMPLETED folder and bad records to ERROR folder (These files should be written to the same location where original file is present)

pool = ThreadPool(len(files_list))
pool.map(process_files, ((file_path) for file_path in files_list))

Questions:

1. Is it good approach to process files like this?

2. If files size is more(each files 1GB) we get OOM(Out Of Memory issue) - cluster config: driver: i3.4xlarge(16 GB) and 4 worker nodes with same size. How we need to process files in this case?