Data shifted when a pyspark dataframe column only contains a comma
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-21-2024 01:58 AM
I have a dataframe containing several columns among which 1 contains, for one specific record, just a comma, nothing else.
When displaying the dataframe with the command
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-21-2024 02:03 AM
Here is a screenshot of my code and the output:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-21-2024 03:11 AM - edited 10-21-2024 03:12 AM
Hi @fabien_arnaud,
I have tried to reproduce the issue using DBR 12.2 and in my case everything works as expected:
Could you share how this dataframe is created? Are you reading some csv file maybe?
Also, could you assign create a new dataframe:
df_filtered = df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838').select(col("postal_cd"),col("state_cd"), col("state_nm"),col("country_cd"), col("country_nm"))
And then run:
df_filtered.printSchema()
df_filtered.show()
Let's check whether it is a problem with the dataframe or maybe display() function renders the dataframe incorrectly due to standalone comma.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-21-2024 04:25 AM
Yes the dataframe reads from a CSV. Here is the code:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-21-2024 05:05 AM
Hi @fabien_arnaud ,
I think I know the issue.
Could you please change your escape character (escape = '"') to be different than your quote character (quote = '"')?
For example set it to \.
In your csv there is a sequence like ","," and one of the quotes is used to escape comma.
Let us know if that helps
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-21-2024 06:28 AM
I actually can't change the escape character because the double quote is the one being used by the source file and is required to correctly parse other columns in the dataframe such as the case below where the name column contains double quotes in the data value:
As mentioned earlier though, the file can be read perfectly with Databricks runtime 15.4LTS so that will probably have to be the way forward. I hadn't upgraded yet because I had issues installing the various dependencies with the new Ubuntu version used by that runtime, but I did manage in the end.
I really appreciate the time you spent trying to help me out and your suggestions, Filip!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-08-2024 05:21 AM
Thank you so much for the solution.

