<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Invisible empty spaces when reading .csv files in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/46441#M28066</link>
    <description>&lt;P&gt;Thank you so much for helping me.&lt;/P&gt;</description>
    <pubDate>Wed, 27 Sep 2023 23:31:17 GMT</pubDate>
    <dc:creator>Raluka</dc:creator>
    <dc:date>2023-09-27T23:31:17Z</dc:date>
    <item>
      <title>Invisible empty spaces when reading .csv files</title>
      <link>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35664#M25938</link>
      <description>&lt;P&gt;When importing a .csv file with leading and/or trailing empty spaces around the separators, the output results in strings that appear to be trimmed on the output table or when using .display() but are not actually trimmed.&lt;BR /&gt;It is possible to identify that the values are not trimmed because the where statement only works when the spaces are included.&lt;/P&gt;&lt;P&gt;The hotfix is to use trim() to make sure that the imported data does not have leading or trailing spaces.&lt;BR /&gt;I'll like to propose that if there are leading or trailing spaces in the input .csv file (erroneously or not), then these spaces are visible in the output table.&lt;/P&gt;&lt;P&gt;Attached is a minimal viable example notebook (CSV_leading_Space_Bug Notebook) that imports the sample_data.csv file. Since I can't submit arbitrary attachments I will write the content of the notebook and sample_data.csv below so that a simple copy and paste recreates the failes. It is important that the empty spaces of&amp;nbsp;sample_data.csv remain as they are.&lt;/P&gt;&lt;P&gt;Best regards,&lt;BR /&gt;Bruno António&lt;BR /&gt;&lt;BR /&gt;sample_data.csv&lt;BR /&gt;Name, Sport, City, Score&lt;BR /&gt;Anna, Soccer,Paris , 123&lt;BR /&gt;Bruno, Tenis,Rome , 75&lt;BR /&gt;Catherina, Volleyball,Oslo , 66&lt;BR /&gt;Diego, Surf , Barcelona,81&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;CSV_leading_Space_Bug Notebook&lt;/P&gt;&lt;P&gt;# Databricks notebook source&lt;BR /&gt;# MAGIC %md&lt;BR /&gt;# MAGIC ## Example of importing a .csv file with leading and trailing empty spaces&lt;BR /&gt;# MAGIC&lt;BR /&gt;# MAGIC Importing a .csv file with spaces around the separators results in **invisible leading and trailing empty spaces** that are difficult to debug.&lt;BR /&gt;# MAGIC The hotfix is to use the `trim()` function but a permanent fix is requested.&lt;/P&gt;&lt;P&gt;# COMMAND ----------&lt;/P&gt;&lt;P&gt;from pyspark.sql.types import StructType, StructField, StringType, LongType, IntegerType, DecimalType, DateType, ByteType, BooleanType&lt;/P&gt;&lt;P&gt;# COMMAND ----------&lt;/P&gt;&lt;P&gt;#read csv file&lt;/P&gt;&lt;P&gt;sampleDataFilePath = "&amp;lt;path-to-file&amp;gt;/sample_data.csv"&lt;/P&gt;&lt;P&gt;schema = StructType([StructField("Name", StringType(), True),&lt;BR /&gt;StructField("Sport", StringType(), True),&lt;BR /&gt;StructField("City", StringType(), True),&lt;BR /&gt;StructField("Score", IntegerType(), True)&lt;/P&gt;&lt;P&gt;])&lt;/P&gt;&lt;P&gt;df = (spark.read.format("csv")&lt;BR /&gt;.schema(schema)&lt;BR /&gt;.options(header=True, enforceSchema=True, inferSchema=False, sep=",")&lt;BR /&gt;.load(sampleDataFilePath)&lt;BR /&gt;)&lt;/P&gt;&lt;P&gt;df.createOrReplaceTempView("sample_data")&lt;/P&gt;&lt;P&gt;# COMMAND ----------&lt;/P&gt;&lt;P&gt;# MAGIC %sql&lt;BR /&gt;# MAGIC select *&lt;BR /&gt;# MAGIC from sample_data&lt;/P&gt;&lt;P&gt;# COMMAND ----------&lt;/P&gt;&lt;P&gt;# MAGIC %python&lt;BR /&gt;# MAGIC spark.sql("""&lt;BR /&gt;# MAGIC select *&lt;BR /&gt;# MAGIC from sample_data&lt;BR /&gt;# MAGIC """).display()&lt;/P&gt;&lt;P&gt;# COMMAND ----------&lt;/P&gt;&lt;P&gt;# MAGIC %md&lt;BR /&gt;# MAGIC It seems that the leading and trailing spaces in the string columns were trimmed. But this is not the case:&lt;/P&gt;&lt;P&gt;# COMMAND ----------&lt;/P&gt;&lt;P&gt;# MAGIC %sql&lt;BR /&gt;# MAGIC select *&lt;BR /&gt;# MAGIC from sample_data&lt;BR /&gt;# MAGIC where city = "Paris"&lt;/P&gt;&lt;P&gt;# COMMAND ----------&lt;/P&gt;&lt;P&gt;# MAGIC %sql&lt;BR /&gt;# MAGIC select *&lt;BR /&gt;# MAGIC from sample_data&lt;BR /&gt;# MAGIC where city = "Paris "&lt;/P&gt;&lt;P&gt;# COMMAND ----------&lt;/P&gt;&lt;P&gt;# MAGIC %sql&lt;BR /&gt;# MAGIC select *&lt;BR /&gt;# MAGIC from sample_data&lt;BR /&gt;# MAGIC where Sport = "Surf"&lt;/P&gt;&lt;P&gt;# COMMAND ----------&lt;/P&gt;&lt;P&gt;# MAGIC %sql&lt;BR /&gt;# MAGIC select *&lt;BR /&gt;# MAGIC from sample_data&lt;BR /&gt;# MAGIC where Sport = "Surf "&lt;/P&gt;&lt;P&gt;# COMMAND ----------&lt;/P&gt;&lt;P&gt;# MAGIC %sql&lt;BR /&gt;# MAGIC select *&lt;BR /&gt;# MAGIC from sample_data&lt;BR /&gt;# MAGIC where Sport = " Surf"&lt;/P&gt;&lt;P&gt;# COMMAND ----------&lt;/P&gt;&lt;P&gt;# MAGIC %sql&lt;BR /&gt;# MAGIC select *&lt;BR /&gt;# MAGIC from sample_data&lt;BR /&gt;# MAGIC where Sport = " Surf "&lt;/P&gt;&lt;P&gt;# COMMAND ----------&lt;/P&gt;&lt;P&gt;# MAGIC %md&lt;BR /&gt;# MAGIC ## Proposal&lt;BR /&gt;# MAGIC&lt;BR /&gt;# MAGIC The hotfix is to use `trim()` to make sure that the imported data does not have leading or trailing spaces.&lt;BR /&gt;# MAGIC&lt;BR /&gt;# MAGIC I'll like to propose that if there are leading or trailing spaces in the input .csv file (erroneously or not), then these spaces are visible in the output table.&lt;/P&gt;</description>
      <pubDate>Wed, 28 Jun 2023 10:37:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35664#M25938</guid>
      <dc:creator>BAZA</dc:creator>
      <dc:date>2023-06-28T10:37:00Z</dc:date>
    </item>
    <item>
      <title>Re: Invisible empty spaces when reading .csv files</title>
      <link>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35665#M25939</link>
      <description>&lt;P&gt;hm are you sure the spaces are not visible?&amp;nbsp; Because using display() is my way to go to detect leading/trailing spaces.&lt;/P&gt;</description>
      <pubDate>Wed, 28 Jun 2023 10:51:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35665#M25939</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-06-28T10:51:20Z</dc:date>
    </item>
    <item>
      <title>Re: Invisible empty spaces when reading .csv files</title>
      <link>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35667#M25940</link>
      <description>&lt;P&gt;Usually displaying the columns is enough to identify the spaces. I often do that to make sure if I need a trim() on join operations because some tables that I work with have trailing spaces. But in this odd case they are not visible. Even copying the data does show the spaces.&lt;/P&gt;</description>
      <pubDate>Wed, 28 Jun 2023 11:11:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35667#M25940</guid>
      <dc:creator>BAZA</dc:creator>
      <dc:date>2023-06-28T11:11:02Z</dc:date>
    </item>
    <item>
      <title>Re: Invisible empty spaces when reading .csv files</title>
      <link>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35669#M25941</link>
      <description>&lt;P&gt;perhaps these are invisible characters and not plain spaces.&lt;/P&gt;</description>
      <pubDate>Wed, 28 Jun 2023 12:06:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35669#M25941</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-06-28T12:06:56Z</dc:date>
    </item>
    <item>
      <title>Re: Invisible empty spaces when reading .csv files</title>
      <link>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35675#M25942</link>
      <description>&lt;P&gt;I created the .csv by hand and wrote the spaces using the space bar. 🤷🏻‍&lt;span class="lia-unicode-emoji" title=":female_sign:"&gt;♀️&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 28 Jun 2023 13:42:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35675#M25942</guid>
      <dc:creator>BAZA</dc:creator>
      <dc:date>2023-06-28T13:42:30Z</dc:date>
    </item>
    <item>
      <title>Re: Invisible empty spaces when reading .csv files</title>
      <link>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35676#M25943</link>
      <description>&lt;P&gt;I see.&lt;BR /&gt;If you actually need the spaces (so trimming is not an option), you could try to detect the spaces using regex.&lt;/P&gt;</description>
      <pubDate>Wed, 28 Jun 2023 13:46:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35676#M25943</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-06-28T13:46:16Z</dc:date>
    </item>
    <item>
      <title>Re: Invisible empty spaces when reading .csv files</title>
      <link>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35678#M25944</link>
      <description>&lt;P&gt;If I use a substr to select that character it returns an empty, but not null string. I cannot manually select the value of the cell. By copy and pasting the output into excel, I can select the space and an online decoder indicates that is a \x20 character.&lt;BR /&gt;&lt;BR /&gt;If I concat a bunch of the same substr the value returned has spaces that I can select. By copy and pasting in a online decoder I manage to identify the characters&amp;nbsp;\x0A\x20. An line feed followed by space(s). It always starts with&amp;nbsp;\x0A and then X times&amp;nbsp;\x20 being X the number of substr that I concated minus 1. The first substr is unselectable.&lt;BR /&gt;&lt;BR /&gt;I think that these spaces should always be visible and selectable.&lt;/P&gt;</description>
      <pubDate>Wed, 28 Jun 2023 14:09:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35678#M25944</guid>
      <dc:creator>BAZA</dc:creator>
      <dc:date>2023-06-28T14:09:28Z</dc:date>
    </item>
    <item>
      <title>Re: Invisible empty spaces when reading .csv files</title>
      <link>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35679#M25945</link>
      <description>&lt;P&gt;agreed&lt;/P&gt;</description>
      <pubDate>Wed, 28 Jun 2023 14:11:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/35679#M25945</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-06-28T14:11:04Z</dc:date>
    </item>
    <item>
      <title>Re: Invisible empty spaces when reading .csv files</title>
      <link>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/46441#M28066</link>
      <description>&lt;P&gt;Thank you so much for helping me.&lt;/P&gt;</description>
      <pubDate>Wed, 27 Sep 2023 23:31:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/invisible-empty-spaces-when-reading-csv-files/m-p/46441#M28066</guid>
      <dc:creator>Raluka</dc:creator>
      <dc:date>2023-09-27T23:31:17Z</dc:date>
    </item>
  </channel>
</rss>

