cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Is it good to process files in multithreading?

Policepatil
New Contributor III

Hi,

I need to process nearly 30 files from different locations and insert records to RDS.

I am using multi-threading to process these files parallelly like below.

 

def process_files(file_path):

    <process files here>

    1. Find bad records based on field validation

    2. Find good records based on field validation

    3. Insert only good records to the RDS

    4. Write good records to COMPLETED folder and bad records to ERROR folder (These files should be written to the same location where original file is present)

 

 

pool = ThreadPool(len(files_list))
pool.map(process_files, ((file_path) for file_path in files_list))
 
Questions:
1. Is it good approach to process files like this?
2. If files size is more(each files 1GB) we get OOM(Out Of Memory issue) - cluster config: driver: i3.4xlarge(16 GB) and 4 worker nodes with same size. How we need to process files in this case?
 
1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @Policepatil , 

- The approach of parallel processing files can increase the overall speed of the operation.
- Multi-threading can optimize CPU usage but not necessarily make I/O operations faster.
- I/O operations like reading and writing files are often the bottleneck in these tasks.
- Processing large files can lead to out-of-memory (OOM) issues.
- To avoid OOM issues, process files in chunks instead of loading the entire file into memory.
- Modify the process_files function to process a file in chunks.
- Use the COPY INTO command in Databricks to insert records from a file path into an existing table.
- The COPY INTO the command reads the file only once to prevent duplication of records.
- Specify the file format and bucket path where the table will be created.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group