topic Data shifted when a pyspark dataframe column only contains a comma in Data Engineering

Data shifted when a pyspark dataframe column only contains a comma

fabien_arnaud — Mon, 21 Oct 2024 08:58:18 GMT

I have a dataframe containing several columns among which 1 contains, for one specific record, just a comma, nothing else.

When displaying the dataframe with the command

display(df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838'))

The data is displayed correctly for all of my columns

However, when I select specific columns from the same dataframe, i.e.

display(df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838').select(col("postal_cd"),col("state_cd"), col("state_nm"),col("country_cd"), col("country_nm")))

all of my data from columns to the right of the one that only contains the comma gets shifted to the left. The comma seems to be identified as a column separator during the "select" although everything is correctly loaded in my dataframe.

How can I avoid this behavior?

I use databricks runtime 12.2LTS and my notebook is in python.

Re: Data shifted when a pyspark dataframe column only contains a comma

fabien_arnaud — Mon, 21 Oct 2024 09:03:23 GMT

Here is a screenshot of my code and the output:

Re: Data shifted when a pyspark dataframe column only contains a comma

filipniziol — Mon, 21 Oct 2024 10:12:17 GMT

Hi @fabien_arnaud,

I have tried to reproduce the issue using DBR 12.2 and in my case everything works as expected:

Could you share how this dataframe is created? Are you reading some csv file maybe?
Also, could you assign create a new dataframe:

df_filtered = df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838').select(col("postal_cd"),col("state_cd"), col("state_nm"),col("country_cd"), col("country_nm"))

And then run:

df_filtered.printSchema()

df_filtered.show()

Let's check whether it is a problem with the dataframe or maybe display() function renders the dataframe incorrectly due to standalone comma.

Re: Data shifted when a pyspark dataframe column only contains a comma

fabien_arnaud — Mon, 21 Oct 2024 11:25:05 GMT

Yes the dataframe reads from a CSV. Here is the code:

df_input = (spark

.read

.format('CSV')

.options(header= True,

delimiter = ",",

quote = '"',

escape = '"',

inferSchema = 'false',

encoding = 'UTF8',

multiline = True,

rootTag = '',

rowTag = '',

attributePrefix = ''

)

.load("dbfs:/mnt/bdwuploaddevfabien-mdm/mdm_vendor_master_2024-09-10.csv")

)

Here is the screenshot of a subsequent filtered dataframe as suggested. The problem persists:

By the way, I tested the code with runtimes 13.3LTS, 14.3LTS and 15.4LTS as well, and the issue occurs with all except 15.4LTS.

Re: Data shifted when a pyspark dataframe column only contains a comma

filipniziol — Mon, 21 Oct 2024 12:05:11 GMT

Hi @fabien_arnaud ,

I think I know the issue.

Could you please change your escape character (escape = '"') to be different than your quote character (quote = '"')?
For example set it to \.

In your csv there is a sequence like ","," and one of the quotes is used to escape comma.

Let us know if that helps

Re: Data shifted when a pyspark dataframe column only contains a comma

fabien_arnaud — Mon, 21 Oct 2024 13:28:39 GMT

I actually can't change the escape character because the double quote is the one being used by the source file and is required to correctly parse other columns in the dataframe such as the case below where the name column contains double quotes in the data value:

As mentioned earlier though, the file can be read perfectly with Databricks runtime 15.4LTS so that will probably have to be the way forward. I hadn't upgraded yet because I had issues installing the various dependencies with the new Ubuntu version used by that runtime, but I did manage in the end.

I really appreciate the time you spent trying to help me out and your suggestions, Filip!

Re: Data shifted when a pyspark dataframe column only contains a comma

MilesMartinez — Fri, 08 Nov 2024 13:21:30 GMT

Thank you so much for the solution.