Databricks Community

Policepatil · ‎08-08-2024

Hi,

I have input files in S3 with below structure.

/mnt/<mount_name>/test/<company_id>/sales/file_1.json

/mnt/<mount_name>/test/<company_id>/sales/file_2.json

/mnt/<mount_name>/test/<company_id>/sales/file_<n>.json

Number of companies = 15

Number of files per company = 30

Total files = 450

Each file contains nearly 180000 records

My question is what the best way to read the file and insert records to Database table.

After reading the file i need to do below operations.

1. Typecast the column

2. Derive some columns from existing columns

3. Filter bad records

4. Join with Item dataframe

5. Filter records which are not matching with item data

6. insert to DB table

7. Write error records to one file(error file) and Write completed one file(completed file)

My Approach:

1. I read all the files in multithreading and write to one location with parquet format (if i write with delta format it takes more time to write and also in multithreading it will fail to write because before writing delta table, table should be created) - this is taking nearly 30 minutes

2. Once all the file written data to one location with parquet format, i read and start processing records(one dataframe with nearly 81,000,000 records) - this is taking several hours to process the records.

Policepatil · ‎08-08-2024

Compute: Multinode cluster

Driver Type: i3.xlarge - 30.5GB, 4 cores

Worker Type: i3.xlarge - 30.5GB, 4 cores

Total number of workers: 4

lprevost · ‎08-08-2024

Are the json files compressed? If they are in .gz, this is unsplittable which means you lose some of spark's parallel magic.

Policepatil · ‎08-09-2024

No files are not compressed.

lprevost · ‎08-09-2024

I’m on my learning curve too but here are a few thoughts for you:

use AUTOLOADER to either read them all in one pass or by company folder in multiple passes. It doesn’t look like much data. I’m doing similar with csv files and each partition has 2B records. Takes like 15-20 min. . Use the read statements glob patterns to target
don't do too much transformation on your first stage. Use the medallion architecture to create a bronze table that just has your raw data and maybe directory paths using _metadata. Put all of it in DBs default which is delta. I don’t think your lag is due to delta vs parquet. I am thinking it’s due to either lack of partitioning or big variations on file size that cause skew to your jobs. You could read, them repartition, then write.
do a second stage to create your silver and gold tables for your items 2-5. Some of that could include a partition or cluster column like company. Also could be when you do join. I bet you’ll get much better performance reading from your bronze delta table to do your item join than if you are reading from your raw json
On your approach item 2 - it shouldn’t take so long — I wonder how you are doing the transforms and joins. That’s a big topic there and too broad to give guidance. But, I see that as your subsequent step, not your first which is to create a raw bronze single table.