Hello everyone,
I'm facing an issue when writing data in CSV format to Azure Data Lake Storage (ADLS). Before writing, there are no duplicates in the DataFrame, and all the records look correct. However, after writing the CSV files to ADLS, I notice duplicates in the node_id:ID(MatlBatchPlant){label:MatlBatchPlant} column.
Here's a summary of my observations:
- DataFrame count before writing to CSV matches the CSV count after writing, yet I still find duplicate rows in the CSV.
- Some records are missing from the CSV despite the DataFrame and CSV file having the same number of rows.
Debugging Steps I've Tried:
- Writing to Parquet format – No duplicates found in the Parquet file; the count matches as expected.
- Writing using a single partition – The problem persists.
- Loading the DataFrame to a Postgres database – No duplicates in the node_id_colin Postgres.
- Reading the Parquet file and converting it to CSV – The duplicate issue reappears in the CSV file.
It seems like the issue only occurs when writing to CSV format, and I haven’t encountered the same behavior with other formats (Parquet or Postgres).
Any insights or suggestions on how to address this issue would be greatly appreciated. Thank you!