Records are missing while creating new data from one big dataframe using filter

Policepatil
New Contributor III

Hi,

I have data in file like below

Policepatil_1-1693806659492.png

Policepatil_3-1693806860560.png

 

I have different types of row in my input file, column number 8 defines the type of the record.

In the above file we have 4 types of records 00 to 03

My requirement is:

  • There will be multiple files in the source path, each file having nearly 1 million records
  • Read the files and create different dataframes based on record type using filter on original dataframe(all record type dataframe)
  • Based on mapping file select the column positions and map it to column name
  • Create dictionary of dataframes with record type is key and dataframe is the value

My code looks like below

Policepatil_4-1693807544901.png

Issue is for some records are missing from result dataframes.

Example: 

  • for id: 1836, record type:01 there should be 15 records but we get only 14. If you re run again, we will get same issue in another file for another id.
  • in the original dataframe: Total 18 rows are there for id:1836, out of 18, 15 are related to record type 01.

 Policepatil_5-1693807898507.png

  • Below dataframe is after filter based on record type, in this dataframe we can see one record is missing. There should be 15 but we have only 14.

Policepatil_6-1693808066454.png

Why records are missing while filtering the records?

Hi @Retired_mod ,

Thanks for your reply.

There is no issue with the data, you can see line number 20 in my code, i have all_trans_df which is created after reading the data from file and  sent it to this function.

we can see the data in that all_trans_df dataframe but not in the result dataframe.

Note: i have nearly 30 files and running in parallel using multithreading. 

Policepatil
New Contributor III

Hi @Retired_mod ,

If i run again with same files sometimes records will be missed from same files of the previous run or records will be missed from different file.

Example:

run1: 1 record missing in file1, no issue with other files

run2: 1 record missing in file3 and file4, no issue with other files