08-08-2024 04:52 AM - edited 08-08-2024 05:21 AM
Hi,
I have input files in S3 with below structure.
/mnt/<mount_name>/test/<company_id>/sales/file_1.json
/mnt/<mount_name>/test/<company_id>/sales/file_2.json
/mnt/<mount_name>/test/<company_id>/sales/file_<n>.json
Number of companies = 15
Number of files per company = 30
Total files = 450
Each file contains nearly 180000 records
My question is what the best way to read the file and insert records to Database table.
After reading the file i need to do below operations.
1. Typecast the column
2. Derive some columns from existing columns
3. Filter bad records
4. Join with Item dataframe
5. Filter records which are not matching with item data
6. insert to DB table
7. Write error records to one file(error file) and Write completed one file(completed file)
My Approach:
1. I read all the files in multithreading and write to one location with parquet format (if i write with delta format it takes more time to write and also in multithreading it will fail to write because before writing delta table, table should be created) - this is taking nearly 30 minutes
2. Once all the file written data to one location with parquet format, i read and start processing records(one dataframe with nearly 81,000,000 records) - this is taking several hours to process the records.
08-08-2024 05:28 AM
Compute: Multinode cluster
Driver Type: i3.xlarge - 30.5GB, 4 cores
Worker Type: i3.xlarge - 30.5GB, 4 cores
Total number of workers: 4
08-08-2024 07:24 AM
Are the json files compressed? If they are in .gz, this is unsplittable which means you lose some of spark's parallel magic.
08-09-2024 05:29 AM
No files are not compressed.
08-09-2024 06:19 AM
I’m on my learning curve too but here are a few thoughts for you:
08-20-2024 10:35 PM
Thanks your response i will check and let know
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group