cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark CSV Incorrect Count

TariqueAnwer
New Contributor II
B1123451020-502,"","{""m"": {""difference"": 60}}","","","",2022-02-12T15:40:00.783Z
B1456741975-266,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z
B1789753479-460,"","",",","","",2022-02-18T14:46:57.332Z
B1456741977-123,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z
df_inputfile = (spark.read.format("com.databricks.spark.csv")
                                     .option("inferSchema", "true")
                                     .option("header","false")                
                                     .option("quotedstring",'\"')
                                     .option("escape",'\"')
                                     .option("multiline","true")
                                     .option("delimiter",",")
                                     .load('<path to csv>'))
 
print(df_inputfile.count()) # Prints 3
print(df_inputfile.distinct().count()) # Prints 4

I'm trying to read the data above from a CSV file and end up with a wrong count, although the dataframe contains all the expected records. df_inputfile.count() prints 3 although it should have been 4.

It looks like this is happening because of the single comma in the 4th column of the 3rd row. Can someone please explain why?

5 REPLIES 5

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, Could you please check the syntax? '\"' ?

Hi Debayan, there's no syntax error in the code snippet. Using .option("escape",'"') makes no difference to the counts. I still get wrong counts.

Kaniz
Community Manager
Community Manager

Hi @Tarique Anwer​  , We haven’t heard from you on the last response from @Debayan Mukherjee​ and I was checking back to see if his suggestions helped you.

Or else, If you have any solution, please do share that with the community as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

TariqueAnwer
New Contributor II

Hi @Kaniz Fatma​ Unfortunately, the suggestion hasn't helped and I've not been able to figure out the reason for the strange results so far.

Anonymous
Not applicable

Hi @Tarique Anwer​ 

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.