topic Re: Pyspark CSV Incorrect Count in Data Engineering

Pyspark CSV Incorrect Count

Tarique — Tue, 27 Sep 2022 13:58:01 GMT

B1123451020-502,"","{""m"": {""difference"": 60}}","","","",2022-02-12T15:40:00.783Z
B1456741975-266,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z
B1789753479-460,"","",",","","",2022-02-18T14:46:57.332Z
B1456741977-123,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z

df_inputfile = (spark.read.format("com.databricks.spark.csv")
                                     .option("inferSchema", "true")
                                     .option("header","false")                
                                     .option("quotedstring",'\"')
                                     .option("escape",'\"')
                                     .option("multiline","true")
                                     .option("delimiter",",")
                                     .load('<path to csv>'))
 
print(df_inputfile.count()) # Prints 3
print(df_inputfile.distinct().count()) # Prints 4

I'm trying to read the data above from a CSV file and end up with a wrong count, although the dataframe contains all the expected records. df_inputfile.count() prints 3 although it should have been 4.

It looks like this is happening because of the single comma in the 4th column of the 3rd row. Can someone please explain why?

Re: Pyspark CSV Incorrect Count

Tarique — Tue, 04 Oct 2022 07:57:20 GMT

Hi Debayan, there's no syntax error in the code snippet. Using .option("escape",'"') makes no difference to the counts. I still get wrong counts.

Re: Pyspark CSV Incorrect Count

Tarique — Tue, 04 Oct 2022 08:00:00 GMT

Hi @Kaniz Fatma Unfortunately, the suggestion hasn't helped and I've not been able to figure out the reason for the strange results so far.

Re: Pyspark CSV Incorrect Count

Anonymous — Mon, 21 Nov 2022 03:43:24 GMT

Hi @Tarique Anwer

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Re: Pyspark CSV Incorrect Count

Debayan — Fri, 30 Sep 2022 06:23:05 GMT

Hi, Could you please check the syntax? '\"' ?