cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Records are missing while creating new data from one big dataframe using filter

Policepatil
New Contributor III

Hi,

I have data in file like below

Policepatil_1-1693806659492.png

Policepatil_3-1693806860560.png

 

I have different types of row in my input file, column number 8 defines the type of the record.

In the above file we have 4 types of records 00 to 03

My requirement is:

  • There will be multiple files in the source path, each file having nearly 1 million records
  • Read the files and create different dataframes based on record type using filter on original dataframe(all record type dataframe)
  • Based on mapping file select the column positions and map it to column name
  • Create dictionary of dataframes with record type is key and dataframe is the value

My code looks like below

Policepatil_4-1693807544901.png

Issue is for some records are missing from result dataframes.

Example: 

  • for id: 1836, record type:01 there should be 15 records but we get only 14. If you re run again, we will get same issue in another file for another id.
  • in the original dataframe: Total 18 rows are there for id:1836, out of 18, 15 are related to record type 01.

 Policepatil_5-1693807898507.png

  • Below dataframe is after filter based on record type, in this dataframe we can see one record is missing. There should be 15 but we have only 14.

Policepatil_6-1693808066454.png

Why records are missing while filtering the records?

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @PolicepatilThere could be multiple reasons why records are missing after filtering.

1. Incorrect Filtering Criteria: Check the filtering criteria used in your code. If the requirements are not correctly defined, it could exclude some records unintentionally.

2. Corrupt or Incomplete Records: If your input file has corrupt or incomplete records, they might be excluded during the data processing. You can handle such documents using options like badRecordsPath or PERMISSIVEDROPMALFORMED, and FAILFAST modes during data loading. Refer to [this documentation](https://docs.databricks.com/ingestion/bad-records.html) for more details 

3. Mismatched Data Types: If the data type of the values in your records does not match the data type defined in your schema, those records could be nullified or excluded.

4. Missing Values: Those records could be excluded if the filtering is based on a specific column with missing values. You can handle missing values using methods like fillna() or dropna().

Refer to [this documentation](https://docs.databricks.com/notebooks/bamboolib.html) for more details.

Remember to carefully review your code and the data in your input files.

If necessary, perform data cleaning and preprocessing steps before filtering the records.

Sources:
- [Docs: bad-records](https://docs.databricks.com/ingestion/bad-records.html)
- [Docs: bamboolib](https://docs.databricks.com/notebooks/bamboolib.html)

Hi @Kaniz_Fatma ,

Thanks for your reply.

There is no issue with the data, you can see line number 20 in my code, i have all_trans_df which is created after reading the data from file and  sent it to this function.

we can see the data in that all_trans_df dataframe but not in the result dataframe.

Note: i have nearly 30 files and running in parallel using multithreading. 

Policepatil
New Contributor III

Hi @Kaniz_Fatma ,

If i run again with same files sometimes records will be missed from same files of the previous run or records will be missed from different file.

Example:

run1: 1 record missing in file1, no issue with other files

run2: 1 record missing in file3 and file4, no issue with other files

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group