cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data shifted when a pyspark dataframe column only contains a comma

fabien_arnaud
New Contributor II

I have a dataframe containing several columns among which 1 contains, for one specific record, just a comma, nothing else.

When displaying the dataframe with the command

display(df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838'))
 
The data is displayed correctly for all of my columns
 
However, when I select specific columns from the same dataframe, i.e.
 
display(df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838').select(col("postal_cd"),col("state_cd"), col("state_nm"),col("country_cd"), col("country_nm")))
 
all of my data from columns to the right of the one that only contains the comma gets shifted to the left. The comma seems to be identified as a column separator during the "select" although everything is correctly loaded in my dataframe.
 How can I avoid this behavior?
 
I use databricks runtime 12.2LTS and my notebook is in python.
6 REPLIES 6

fabien_arnaud
New Contributor II

Here is a screenshot of my code and the output: 

fabien_arnaud_0-1729501314813.png

 

filipniziol
Contributor III

Hi @fabien_arnaud

I have tried to reproduce the issue using DBR 12.2 and in my case everything works as expected:

filipniziol_0-1729505239655.png

Could you share how this dataframe is created? Are you reading some csv file maybe?
Also, could you assign create a new dataframe:

 

df_filtered = df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838').select(col("postal_cd"),col("state_cd"), col("state_nm"),col("country_cd"), col("country_nm"))

 

And then run:

df_filtered.printSchema()

df_filtered.show()

Let's check whether it is a problem with the dataframe or maybe display() function renders the dataframe incorrectly due to standalone comma.

 

 

fabien_arnaud
New Contributor II

Yes the dataframe reads from a CSV. Here is the code:

 

df_input = (spark
                    .read
                    .format('CSV')
                    .options(header= True,
                            delimiter = ",",
                            quote = '"',
                            escape = '"',
                            inferSchema = 'false',
                            encoding = 'UTF8',
                            multiline = True,
                            rootTag = '',
                            rowTag = '',
                            attributePrefix = ''
                            )
                    .load("dbfs:/mnt/bdwuploaddevfabien-mdm/mdm_vendor_master_2024-09-10.csv")
                  )
 
Here is the screenshot of a subsequent filtered dataframe as suggested. The problem persists:  
 
fabien_arnaud_0-1729509830683.png

 

 
By the way, I tested the code with runtimes 13.3LTS, 14.3LTS and 15.4LTS as well, and the issue occurs with all except 15.4LTS.
 
 

filipniziol
Contributor III

Hi @fabien_arnaud ,

I think I know the issue.

Could you please change your escape character (escape = '"') to be different than your quote character (quote = '"')?
For example set it to \.

In your csv there is a sequence like ","," and one of the quotes is used to escape comma.

Let us know if that helps

fabien_arnaud
New Contributor II

I actually can't change the escape character because the double quote is the one being used by the source file and is required to correctly parse other columns in the dataframe such as the case below where the name column contains double quotes in the data value:

fabien_arnaud_0-1729516960100.png

As mentioned earlier though, the file can be read perfectly with Databricks runtime 15.4LTS so that will probably have to be the way forward. I hadn't upgraded yet because I had issues installing the various dependencies with the new Ubuntu version used by that runtime, but I did manage in the end.

I really appreciate the time you spent trying to help me out and your suggestions, Filip!

MilesMartinez
New Contributor II

Thank you so much for the solution.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group