Databricks Community

fabien_arnaud · 19 hours ago

I have a dataframe containing several columns among which 1 contains, for one specific record, just a comma, nothing else.

When displaying the dataframe with the command

display(df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838'))

The data is displayed correctly for all of my columns

However, when I select specific columns from the same dataframe, i.e.

display(df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838').select(col("postal_cd"),col("state_cd"), col("state_nm"),col("country_cd"), col("country_nm")))

all of my data from columns to the right of the one that only contains the comma gets shifted to the left. The comma seems to be identified as a column separator during the "select" although everything is correctly loaded in my dataframe.

How can I avoid this behavior?

I use databricks runtime 12.2LTS and my notebook is in python.

fabien_arnaud · 19 hours ago

Here is a screenshot of my code and the output:

filipniziol · 17 hours ago

Hi @fabien_arnaud,

I have tried to reproduce the issue using DBR 12.2 and in my case everything works as expected:

Could you share how this dataframe is created? Are you reading some csv file maybe?
Also, could you assign create a new dataframe:

df_filtered = df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838').select(col("postal_cd"),col("state_cd"), col("state_nm"),col("country_cd"), col("country_nm"))

And then run:

df_filtered.printSchema()

df_filtered.show()

Let's check whether it is a problem with the dataframe or maybe display() function renders the dataframe incorrectly due to standalone comma.

fabien_arnaud · 16 hours ago

Yes the dataframe reads from a CSV. Here is the code:

df_input = (spark

.read

.format('CSV')

.options(header= True,

delimiter = ",",

quote = '"',

escape = '"',

inferSchema = 'false',

encoding = 'UTF8',

multiline = True,

rootTag = '',

rowTag = '',

attributePrefix = ''

)

.load("dbfs:/mnt/bdwuploaddevfabien-mdm/mdm_vendor_master_2024-09-10.csv")

)

Here is the screenshot of a subsequent filtered dataframe as suggested. The problem persists:

By the way, I tested the code with runtimes 13.3LTS, 14.3LTS and 15.4LTS as well, and the issue occurs with all except 15.4LTS.

filipniziol · 16 hours ago

Hi @fabien_arnaud ,

I think I know the issue.

Could you please change your escape character (escape = '"') to be different than your quote character (quote = '"')?
For example set it to \.

In your csv there is a sequence like ","," and one of the quotes is used to escape comma.

Let us know if that helps

fabien_arnaud · 14 hours ago

I actually can't change the escape character because the double quote is the one being used by the source file and is required to correctly parse other columns in the dataframe such as the case below where the name column contains double quotes in the data value:

As mentioned earlier though, the file can be read perfectly with Databricks runtime 15.4LTS so that will probably have to be the way forward. I hadn't upgraded yet because I had issues installing the various dependencies with the new Ubuntu version used by that runtime, but I did manage in the end.

I really appreciate the time you spent trying to help me out and your suggestions, Filip!

Databricks Community

Data shifted when a pyspark dataframe column only contains a comma

Connect with Databricks Users in Your Area

South Florida Databricks User Group: Accelerate Projects to Value with GenAI

Struggling with BI? We want to hear from you!

Introducing Databricks Apps

Databricks Community Champion - September 2024 - Szymon Dybczak

Intelligent Data Engineering: Beyond the AI Hype