cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best Way to process large number of records from multiple files

Policepatil
New Contributor III

Hi,

I have input files in S3 with below structure.

/mnt/<mount_name>/test/<company_id>/sales/file_1.json

/mnt/<mount_name>/test/<company_id>/sales/file_2.json

/mnt/<mount_name>/test/<company_id>/sales/file_<n>.json

Number of companies = 15

Number of files per company = 30

Total files = 450

Each file contains nearly 180000 records

My question is what the best way to read the file and insert records to Database table.

After reading the file i need to do below operations.

1. Typecast the column

2. Derive some columns from existing columns

3. Filter bad records

4. Join with Item dataframe

5. Filter records which are not matching with item data

6.  insert to DB table

7. Write error records to one file(error file) and Write completed one file(completed file)

My Approach:

1. I read all the files in multithreading and write to one location with parquet format (if i write with delta format it takes more time to write and also in multithreading it will fail to write because before writing delta table, table should be created) - this is taking nearly 30 minutes

2. Once all the file written data to one location with parquet format, i read and start processing records(one dataframe with nearly 81,000,000 records) - this is taking several hours to process the records.

5 REPLIES 5

Policepatil
New Contributor III

Compute: Multinode cluster

Driver Type: i3.xlarge - 30.5GB, 4 cores

Worker Type: i3.xlarge - 30.5GB, 4 cores

Total number of workers:  4

lprevost
Contributor

Are the json files compressed?  If they are in .gz, this is unsplittable which means you lose some of spark's parallel magic.

Policepatil
New Contributor III

No files are not compressed.

I’m on my learning curve too but here are a few thoughts for you:

  1. use AUTOLOADER to either read them all in one pass or by company folder in multiple passes.  It doesn’t look like much data.  I’m doing similar with csv files and each partition has 2B records.  Takes like 15-20 min.  .  Use the read statements glob patterns to target

  2. don't do too much transformation on your first stage.  Use the medallion architecture to create a bronze table that just has your raw data and maybe directory paths using _metadata. Put all of it in DBs default  which is delta.  I don’t think your lag is due to delta vs parquet.  I am thinking it’s due to either lack of partitioning or big variations on file size that cause skew to your jobs.  You could read, them repartition, then write.  
  3. do a second stage to create your silver and gold tables for your items 2-5.  Some of that could include a partition or cluster column like company.  Also could be when you do join. I bet you’ll get much better performance reading from your bronze delta table to do your item join than if you are reading from your raw json
  4. On your approach item 2 - it shouldn’t take so long — I wonder how you are doing the transforms and joins.   That’s a big topic there and too broad to give guidance.  But, I see that as your subsequent step, not your first which is to create a raw bronze single table.  

Policepatil
New Contributor III

Thanks your response i will check and let know

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group