<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Data shifted when a pyspark dataframe column only contains a comma in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95148#M39781</link>
    <description>&lt;P&gt;I have a dataframe containing several columns among which 1 contains, for one specific record, just a comma, nothing else.&lt;/P&gt;&lt;P&gt;When displaying the dataframe with the command&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;(df_input.&lt;/SPAN&gt;&lt;SPAN&gt;where&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"erp_vendor_cd"&lt;/SPAN&gt;&lt;SPAN&gt;) &lt;/SPAN&gt;&lt;SPAN&gt;==&lt;/SPAN&gt; &lt;SPAN&gt;'B6SA-VEN0008838'&lt;/SPAN&gt;&lt;SPAN&gt;))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;The data is displayed correctly for all of my columns&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;However, when I select specific columns from the same dataframe, i.e.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;(df_input.&lt;/SPAN&gt;&lt;SPAN&gt;where&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"erp_vendor_cd"&lt;/SPAN&gt;&lt;SPAN&gt;) &lt;/SPAN&gt;&lt;SPAN&gt;==&lt;/SPAN&gt; &lt;SPAN&gt;'B6SA-VEN0008838'&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;select&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"postal_cd"&lt;/SPAN&gt;&lt;SPAN&gt;),&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"state_cd"&lt;/SPAN&gt;&lt;SPAN&gt;), &lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"state_nm"&lt;/SPAN&gt;&lt;SPAN&gt;),&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"country_cd"&lt;/SPAN&gt;&lt;SPAN&gt;), &lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"country_nm"&lt;/SPAN&gt;&lt;SPAN&gt;)))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;all of my data from columns to the right of the one that only contains the comma gets shifted to the left. The comma seems to be identified as a column separator during the "select" although everything is correctly loaded in my dataframe.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp;How can I avoid this behavior?&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;I use databricks runtime 12.2LTS and my notebook is in python.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Mon, 21 Oct 2024 08:58:18 GMT</pubDate>
    <dc:creator>fabien_arnaud</dc:creator>
    <dc:date>2024-10-21T08:58:18Z</dc:date>
    <item>
      <title>Data shifted when a pyspark dataframe column only contains a comma</title>
      <link>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95148#M39781</link>
      <description>&lt;P&gt;I have a dataframe containing several columns among which 1 contains, for one specific record, just a comma, nothing else.&lt;/P&gt;&lt;P&gt;When displaying the dataframe with the command&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;(df_input.&lt;/SPAN&gt;&lt;SPAN&gt;where&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"erp_vendor_cd"&lt;/SPAN&gt;&lt;SPAN&gt;) &lt;/SPAN&gt;&lt;SPAN&gt;==&lt;/SPAN&gt; &lt;SPAN&gt;'B6SA-VEN0008838'&lt;/SPAN&gt;&lt;SPAN&gt;))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;The data is displayed correctly for all of my columns&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;However, when I select specific columns from the same dataframe, i.e.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;(df_input.&lt;/SPAN&gt;&lt;SPAN&gt;where&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"erp_vendor_cd"&lt;/SPAN&gt;&lt;SPAN&gt;) &lt;/SPAN&gt;&lt;SPAN&gt;==&lt;/SPAN&gt; &lt;SPAN&gt;'B6SA-VEN0008838'&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;select&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"postal_cd"&lt;/SPAN&gt;&lt;SPAN&gt;),&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"state_cd"&lt;/SPAN&gt;&lt;SPAN&gt;), &lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"state_nm"&lt;/SPAN&gt;&lt;SPAN&gt;),&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"country_cd"&lt;/SPAN&gt;&lt;SPAN&gt;), &lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"country_nm"&lt;/SPAN&gt;&lt;SPAN&gt;)))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;all of my data from columns to the right of the one that only contains the comma gets shifted to the left. The comma seems to be identified as a column separator during the "select" although everything is correctly loaded in my dataframe.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp;How can I avoid this behavior?&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;I use databricks runtime 12.2LTS and my notebook is in python.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 21 Oct 2024 08:58:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95148#M39781</guid>
      <dc:creator>fabien_arnaud</dc:creator>
      <dc:date>2024-10-21T08:58:18Z</dc:date>
    </item>
    <item>
      <title>Re: Data shifted when a pyspark dataframe column only contains a comma</title>
      <link>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95150#M39782</link>
      <description>&lt;P&gt;Here is a screenshot of my code and the output:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="fabien_arnaud_0-1729501314813.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/12102i09344810A762DB98/image-size/medium?v=v2&amp;amp;px=400" role="button" title="fabien_arnaud_0-1729501314813.png" alt="fabien_arnaud_0-1729501314813.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2024 09:03:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95150#M39782</guid>
      <dc:creator>fabien_arnaud</dc:creator>
      <dc:date>2024-10-21T09:03:23Z</dc:date>
    </item>
    <item>
      <title>Re: Data shifted when a pyspark dataframe column only contains a comma</title>
      <link>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95189#M39783</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/128287"&gt;@fabien_arnaud&lt;/a&gt;,&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I have tried to reproduce the issue using DBR 12.2 and in my case everything works as expected:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_0-1729505239655.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/12107iE0255EC69C79444C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_0-1729505239655.png" alt="filipniziol_0-1729505239655.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Could you share how this dataframe is created? Are you reading some csv file maybe?&lt;BR /&gt;Also, could you assign create a new dataframe:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df_filtered = df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838').select(col("postal_cd"),col("state_cd"), col("state_nm"),col("country_cd"), col("country_nm"))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;And then run:&lt;BR /&gt;&lt;BR /&gt;df_filtered.printSchema()&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;df_filtered.show()&lt;BR /&gt;&lt;BR /&gt;Let's check whether it is a problem with the dataframe or maybe display() function renders the dataframe incorrectly due to standalone comma.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2024 10:12:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95189#M39783</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-10-21T10:12:17Z</dc:date>
    </item>
    <item>
      <title>Re: Data shifted when a pyspark dataframe column only contains a comma</title>
      <link>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95223#M39784</link>
      <description>&lt;P&gt;Yes the dataframe reads from a CSV. Here is the code:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df_input &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; (spark&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .read&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'CSV'&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;options&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;header&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;True&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;delimiter&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;","&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;quote&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;'"'&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;escape&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;'"'&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;inferSchema&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;'false'&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;encoding&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;'UTF8'&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;multiline&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;True&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;rootTag&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;''&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;rowTag&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;''&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;attributePrefix&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;''&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; )&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;load&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"dbfs:/mnt/bdwuploaddevfabien-mdm/mdm_vendor_master_2024-09-10.csv"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; )&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Here is the screenshot of a subsequent filtered dataframe as suggested. The problem persists:&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="fabien_arnaud_0-1729509830683.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/12114i36D16DE85F7BE9DB/image-size/medium?v=v2&amp;amp;px=400" role="button" title="fabien_arnaud_0-1729509830683.png" alt="fabien_arnaud_0-1729509830683.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;By the way, I tested the code with runtimes 13.3LTS, 14.3LTS and 15.4LTS as well, and the issue occurs with all except 15.4LTS.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 21 Oct 2024 11:25:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95223#M39784</guid>
      <dc:creator>fabien_arnaud</dc:creator>
      <dc:date>2024-10-21T11:25:05Z</dc:date>
    </item>
    <item>
      <title>Re: Data shifted when a pyspark dataframe column only contains a comma</title>
      <link>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95236#M39785</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/128287"&gt;@fabien_arnaud&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;I think I know the issue.&lt;BR /&gt;&lt;BR /&gt;Could you please change your escape character (&lt;SPAN&gt;escape&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;'"') to be different than your quote character (quote&amp;nbsp;=&amp;nbsp;'"')?&lt;BR /&gt;For example set it to \.&lt;BR /&gt;&lt;BR /&gt;In your csv there is a sequence like ","," and one of the quotes is used to escape comma.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Let us know if that helps&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2024 12:05:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95236#M39785</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-10-21T12:05:11Z</dc:date>
    </item>
    <item>
      <title>Re: Data shifted when a pyspark dataframe column only contains a comma</title>
      <link>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95275#M39786</link>
      <description>&lt;P&gt;I actually can't change the escape character because the double quote is the one being used by the source file and is required to correctly parse other columns in the dataframe such as the case below where the name column contains double quotes in the data value:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="fabien_arnaud_0-1729516960100.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/12119i5F96D5C8F267717A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="fabien_arnaud_0-1729516960100.png" alt="fabien_arnaud_0-1729516960100.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;As mentioned earlier though, the file can be read perfectly with Databricks runtime 15.4LTS so that will probably have to be the way forward. I hadn't upgraded yet because I had issues installing the various dependencies with the new Ubuntu version used by that runtime, but I did manage in the end.&lt;/P&gt;&lt;P&gt;I really appreciate the time you spent trying to help me out and your suggestions, Filip!&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2024 13:28:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/95275#M39786</guid>
      <dc:creator>fabien_arnaud</dc:creator>
      <dc:date>2024-10-21T13:28:39Z</dc:date>
    </item>
    <item>
      <title>Re: Data shifted when a pyspark dataframe column only contains a comma</title>
      <link>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/98189#M39787</link>
      <description>&lt;P&gt;Thank you so much for the solution.&lt;/P&gt;</description>
      <pubDate>Fri, 08 Nov 2024 13:21:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-shifted-when-a-pyspark-dataframe-column-only-contains-a/m-p/98189#M39787</guid>
      <dc:creator>MilesMartinez</dc:creator>
      <dc:date>2024-11-08T13:21:30Z</dc:date>
    </item>
  </channel>
</rss>

