topic Re: Invisible empty spaces when reading .csv files in Data Engineering

Invisible empty spaces when reading .csv files

BAZA — Wed, 28 Jun 2023 10:37:00 GMT

When importing a .csv file with leading and/or trailing empty spaces around the separators, the output results in strings that appear to be trimmed on the output table or when using .display() but are not actually trimmed.
It is possible to identify that the values are not trimmed because the where statement only works when the spaces are included.

The hotfix is to use trim() to make sure that the imported data does not have leading or trailing spaces.
I'll like to propose that if there are leading or trailing spaces in the input .csv file (erroneously or not), then these spaces are visible in the output table.

Attached is a minimal viable example notebook (CSV_leading_Space_Bug Notebook) that imports the sample_data.csv file. Since I can't submit arbitrary attachments I will write the content of the notebook and sample_data.csv below so that a simple copy and paste recreates the failes. It is important that the empty spaces of sample_data.csv remain as they are.

Best regards,
Bruno António

sample_data.csv
Name, Sport, City, Score
Anna, Soccer,Paris , 123
Bruno, Tenis,Rome , 75
Catherina, Volleyball,Oslo , 66
Diego, Surf , Barcelona,81

CSV_leading_Space_Bug Notebook

# Databricks notebook source
# MAGIC %md
# MAGIC ## Example of importing a .csv file with leading and trailing empty spaces
# MAGIC
# MAGIC Importing a .csv file with spaces around the separators results in **invisible leading and trailing empty spaces** that are difficult to debug.
# MAGIC The hotfix is to use the `trim()` function but a permanent fix is requested.

# COMMAND ----------

from pyspark.sql.types import StructType, StructField, StringType, LongType, IntegerType, DecimalType, DateType, ByteType, BooleanType

# COMMAND ----------

#read csv file

sampleDataFilePath = "<path-to-file>/sample_data.csv"

schema = StructType([StructField("Name", StringType(), True),
StructField("Sport", StringType(), True),
StructField("City", StringType(), True),
StructField("Score", IntegerType(), True)

])

df = (spark.read.format("csv")
.schema(schema)
.options(header=True, enforceSchema=True, inferSchema=False, sep=",")
.load(sampleDataFilePath)
)

df.createOrReplaceTempView("sample_data")

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data

# COMMAND ----------

# MAGIC %python
# MAGIC spark.sql("""
# MAGIC select *
# MAGIC from sample_data
# MAGIC """).display()

# COMMAND ----------

# MAGIC %md
# MAGIC It seems that the leading and trailing spaces in the string columns were trimmed. But this is not the case:

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data
# MAGIC where city = "Paris"

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data
# MAGIC where city = "Paris "

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data
# MAGIC where Sport = "Surf"

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data
# MAGIC where Sport = "Surf "

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data
# MAGIC where Sport = " Surf"

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data
# MAGIC where Sport = " Surf "

# COMMAND ----------

# MAGIC %md
# MAGIC ## Proposal
# MAGIC
# MAGIC The hotfix is to use `trim()` to make sure that the imported data does not have leading or trailing spaces.
# MAGIC
# MAGIC I'll like to propose that if there are leading or trailing spaces in the input .csv file (erroneously or not), then these spaces are visible in the output table.

Re: Invisible empty spaces when reading .csv files

-werners- — Wed, 28 Jun 2023 10:51:20 GMT

hm are you sure the spaces are not visible? Because using display() is my way to go to detect leading/trailing spaces.

Re: Invisible empty spaces when reading .csv files

BAZA — Wed, 28 Jun 2023 11:11:02 GMT

Usually displaying the columns is enough to identify the spaces. I often do that to make sure if I need a trim() on join operations because some tables that I work with have trailing spaces. But in this odd case they are not visible. Even copying the data does show the spaces.

Re: Invisible empty spaces when reading .csv files

-werners- — Wed, 28 Jun 2023 12:06:56 GMT

perhaps these are invisible characters and not plain spaces.

Re: Invisible empty spaces when reading .csv files

BAZA — Wed, 28 Jun 2023 13:42:30 GMT

I created the .csv by hand and wrote the spaces using the space bar. 🤷🏻‍♀️

Re: Invisible empty spaces when reading .csv files

-werners- — Wed, 28 Jun 2023 13:46:16 GMT

I see.
If you actually need the spaces (so trimming is not an option), you could try to detect the spaces using regex.

Re: Invisible empty spaces when reading .csv files

BAZA — Wed, 28 Jun 2023 14:09:28 GMT

If I use a substr to select that character it returns an empty, but not null string. I cannot manually select the value of the cell. By copy and pasting the output into excel, I can select the space and an online decoder indicates that is a \x20 character.

If I concat a bunch of the same substr the value returned has spaces that I can select. By copy and pasting in a online decoder I manage to identify the characters \x0A\x20. An line feed followed by space(s). It always starts with \x0A and then X times \x20 being X the number of substr that I concated minus 1. The first substr is unselectable.

I think that these spaces should always be visible and selectable.

Re: Invisible empty spaces when reading .csv files

-werners- — Wed, 28 Jun 2023 14:11:04 GMT

agreed

Re: Invisible empty spaces when reading .csv files

Raluka — Wed, 27 Sep 2023 23:31:17 GMT

Thank you so much for helping me.