Databricks Community

Policepatil · ‎09-04-2023

Hi,

I have data in file like below

I have different types of row in my input file, column number 8 defines the type of the record.

In the above file we have 4 types of records 00 to 03

My requirement is:

There will be multiple files in the source path, each file having nearly 1 million records
Read the files and create different dataframes based on record type using filter on original dataframe(all record type dataframe)
Based on mapping file select the column positions and map it to column name
Create dictionary of dataframes with record type is key and dataframe is the value

My code looks like below

Issue is for some records are missing from result dataframes.

Example:

for id: 1836, record type:01 there should be 15 records but we get only 14. If you re run again, we will get same issue in another file for another id.
in the original dataframe: Total 18 rows are there for id:1836, out of 18, 15 are related to record type 01.

Below dataframe is after filter based on record type, in this dataframe we can see one record is missing. There should be 15 but we have only 14.

Why records are missing while filtering the records?

Note: i have nearly 30 files and running in parallel using multithreading.

If i run again with same files sometimes records will be missed from same files of the previous run or records will be missed from different file.

Example:

run1: 1 record missing in file1, no issue with other files

run2: 1 record missing in file3 and file4, no issue with other files

Kaniz_Fatma · ‎09-06-2023

Hi @Policepatil , The issue you're experiencing with missing records could be due to a variety of reasons. It could be related to how Spark handles data partitioning, or it might be due to some data quality issues in your input files.

One possible explanation could be related to the use of multithreading. Apache Spark is designed to be thread-safe and it uses its own mechanism for parallel processing. If you're running multiple threads in parallel, each thread may have its own SparkContext, which could cause inconsistencies in the data processing.

Another possible reason could be due to bad records or corrupted data in your input files. Spark provides several options for dealing with files that contain bad records. For instance, you can use the badRecordsPath option to specify a path to record exceptions for bad records or files encountered during data loading.

Here is a code snippet to handle bad records:

python
df = spark.read \
  .option("badRecordsPath", "/tmp/badRecordsPath") \
  .format("parquet").load("/input/parquetFile")

In the above example, Spark creates an exception file in JSON format to record the error if it is unable to find the input file or if the file contains bad records. Without more detailed information about your code and data, it's hard to provide a more specific solution. I would recommend checking your input files for data quality issues and also reconsider the use of multithreading.

Databricks Community

Records are missing while creating new dataframe from one big dataframe using filter

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 10 October - 31 October

Databricks Community Champion - September 2024 - Szymon Dybczak

Intelligent Data Engineering: Beyond the AI Hype

GenAI: The Shift to Data Intelligence

Big Book of Data Engineering — 3rd Edition