cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Is it good to process files in multithreading?

Policepatil
New Contributor III

Hi,

I need to process nearly 30 files from different locations and insert records to RDS.

I am using multi-threading to process these files parallelly like below.

 

def process_files(file_path):

    <process files here>

    1. Find bad records based on field validation

    2. Find good records based on field validation

    3. Insert only good records to the RDS

    4. Write good records to COMPLETED folder and bad records to ERROR folder (These files should be written to the same location where original file is present)

 

 

pool = ThreadPool(len(files_list))
pool.map(process_files, ((file_path) for file_path in files_list))
 
Questions:
1. Is it good approach to process files like this?
2. If files size is more(each files 1GB) we get OOM(Out Of Memory issue) - cluster config: driver: i3.4xlarge(16 GB) and 4 worker nodes with same size. How we need to process files in this case?
 
0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group