Hi @Policepatil, There could be multiple reasons why records are missing after filtering.
1. Incorrect Filtering Criteria: Check the filtering criteria used in your code. If the requirements are not correctly defined, it could exclude some records unintentionally.
2. Corrupt or Incomplete Records: If your input file has corrupt or incomplete records, they might be excluded during the data processing. You can handle such documents using options like badRecordsPath
or PERMISSIVE
, DROPMALFORMED
, and FAILFAST
modes during data loading. Refer to [this documentation](https://docs.databricks.com/ingestion/bad-records.html) for more details
3. Mismatched Data Types: If the data type of the values in your records does not match the data type defined in your schema, those records could be nullified or excluded.
4. Missing Values: Those records could be excluded if the filtering is based on a specific column with missing values. You can handle missing values using methods like fillna()
or dropna()
.
Refer to [this documentation](https://docs.databricks.com/notebooks/bamboolib.html) for more details.
Remember to carefully review your code and the data in your input files.
If necessary, perform data cleaning and preprocessing steps before filtering the records.
Sources:
- [Docs: bad-records](https://docs.databricks.com/ingestion/bad-records.html)
- [Docs: bamboolib](https://docs.databricks.com/notebooks/bamboolib.html)