Data shifted when a pyspark dataframe column only ...

fabien_arnaud · ‎10-21-2024

I have a dataframe containing several columns among which 1 contains, for one specific record, just a comma, nothing else.

When displaying the dataframe with the command

display(df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838'))

The data is displayed correctly for all of my columns

However, when I select specific columns from the same dataframe, i.e.

display(df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838').select(col("postal_cd"),col("state_cd"), col("state_nm"),col("country_cd"), col("country_nm")))

all of my data from columns to the right of the one that only contains the comma gets shifted to the left. The comma seems to be identified as a column separator during the "select" although everything is correctly loaded in my dataframe.

How can I avoid this behavior?

I use databricks runtime 12.2LTS and my notebook is in python.

Data shifted when a pyspark dataframe column only contains a comma