cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Invisible empty spaces when reading .csv files

BAZA
New Contributor III

When importing a .csv file with leading and/or trailing empty spaces around the separators, the output results in strings that appear to be trimmed on the output table or when using .display() but are not actually trimmed.
It is possible to identify that the values are not trimmed because the where statement only works when the spaces are included.

The hotfix is to use trim() to make sure that the imported data does not have leading or trailing spaces.
I'll like to propose that if there are leading or trailing spaces in the input .csv file (erroneously or not), then these spaces are visible in the output table.

Attached is a minimal viable example notebook (CSV_leading_Space_Bug Notebook) that imports the sample_data.csv file. Since I can't submit arbitrary attachments I will write the content of the notebook and sample_data.csv below so that a simple copy and paste recreates the failes. It is important that the empty spaces of sample_data.csv remain as they are.

Best regards,
Bruno Antรณnio

sample_data.csv
Name, Sport, City, Score
Anna, Soccer,Paris , 123
Bruno, Tenis,Rome , 75
Catherina, Volleyball,Oslo , 66
Diego, Surf , Barcelona,81

 

CSV_leading_Space_Bug Notebook

# Databricks notebook source
# MAGIC %md
# MAGIC ## Example of importing a .csv file with leading and trailing empty spaces
# MAGIC
# MAGIC Importing a .csv file with spaces around the separators results in **invisible leading and trailing empty spaces** that are difficult to debug.
# MAGIC The hotfix is to use the `trim()` function but a permanent fix is requested.

# COMMAND ----------

from pyspark.sql.types import StructType, StructField, StringType, LongType, IntegerType, DecimalType, DateType, ByteType, BooleanType

# COMMAND ----------

#read csv file

sampleDataFilePath = "<path-to-file>/sample_data.csv"

schema = StructType([StructField("Name", StringType(), True),
StructField("Sport", StringType(), True),
StructField("City", StringType(), True),
StructField("Score", IntegerType(), True)

])

df = (spark.read.format("csv")
.schema(schema)
.options(header=True, enforceSchema=True, inferSchema=False, sep=",")
.load(sampleDataFilePath)
)

df.createOrReplaceTempView("sample_data")

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data

# COMMAND ----------

# MAGIC %python
# MAGIC spark.sql("""
# MAGIC select *
# MAGIC from sample_data
# MAGIC """).display()

# COMMAND ----------

# MAGIC %md
# MAGIC It seems that the leading and trailing spaces in the string columns were trimmed. But this is not the case:

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data
# MAGIC where city = "Paris"

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data
# MAGIC where city = "Paris "

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data
# MAGIC where Sport = "Surf"

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data
# MAGIC where Sport = "Surf "

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data
# MAGIC where Sport = " Surf"

# COMMAND ----------

# MAGIC %sql
# MAGIC select *
# MAGIC from sample_data
# MAGIC where Sport = " Surf "

# COMMAND ----------

# MAGIC %md
# MAGIC ## Proposal
# MAGIC
# MAGIC The hotfix is to use `trim()` to make sure that the imported data does not have leading or trailing spaces.
# MAGIC
# MAGIC I'll like to propose that if there are leading or trailing spaces in the input .csv file (erroneously or not), then these spaces are visible in the output table.

13 REPLIES 13

-werners-
Esteemed Contributor III

hm are you sure the spaces are not visible?  Because using display() is my way to go to detect leading/trailing spaces.

BAZA
New Contributor III

Usually displaying the columns is enough to identify the spaces. I often do that to make sure if I need a trim() on join operations because some tables that I work with have trailing spaces. But in this odd case they are not visible. Even copying the data does show the spaces.

-werners-
Esteemed Contributor III

perhaps these are invisible characters and not plain spaces.

BAZA
New Contributor III

I created the .csv by hand and wrote the spaces using the space bar. ๐Ÿคท๐Ÿปโ€โ™€๏ธ

-werners-
Esteemed Contributor III

I see.
If you actually need the spaces (so trimming is not an option), you could try to detect the spaces using regex.

BAZA
New Contributor III

If I use a substr to select that character it returns an empty, but not null string. I cannot manually select the value of the cell. By copy and pasting the output into excel, I can select the space and an online decoder indicates that is a \x20 character.

If I concat a bunch of the same substr the value returned has spaces that I can select. By copy and pasting in a online decoder I manage to identify the characters \x0A\x20. An line feed followed by space(s). It always starts with \x0A and then X times \x20 being X the number of substr that I concated minus 1. The first substr is unselectable.

I think that these spaces should always be visible and selectable.

-werners-
Esteemed Contributor III

agreed

Raluka
New Contributor III

Thank you so much for helping me.

Raluka
New Contributor III

I discovered an in-depth article that went beyond the physical aspects of aging and testosterone. It examined the emotional https://misterolympia.shop/buy/injectable-steroids/testosterone/testosterone-cypionate/ and psychological aspects of growing older while emphasizing the role of hormone optimization in maintaining a sense of vitality, purpose, and fulfillment in the later stages of life.

sallytomato
New Contributor III

Iโ€™ve been working on a new marketing campaign for my business, and Iโ€™m really struggling to maintain consistency in branding across all print materials. It seems like every time I print something new, the colors, fonts, or even the layout look a bit off compared to what I have on my website or social media. How do you ensure that your print materials always match your digital branding exactly? Any tips or tools you use to help with this?

sallytomato
New Contributor III

Ensuring consistency in branding across all print materials is crucial, but itโ€™s not always easy to achieve, especially when dealing with different media. One key factor is to establish a well-defined brand guide that includes specifics on your colors, fonts, and logo usage. Having these standards written down will make it easier to reference when designing anything, whether it's print or digital.

Additionally, working with a professional printing service can help a lot in maintaining brand consistency. For instance, when you use a platform like GoPrint https://goprint.ae/ , they can provide you with color matching services to ensure the colors on your printed materials are exactly as they appear on screen. They also have high-quality printers that ensure clarity, sharpness, and accurate reproduction of your brandโ€™s design. Itโ€™s important to use the same resolution for your designs, so no details are lost in translation from digital to print. They even offer proofs so you can double-check your designs before going to full print, helping avoid any mismatches.

sallytomato
New Contributor III

Iโ€™ve found that investing in high-quality print services like GoPrint really makes a difference in ensuring your materials match perfectly. Also, it's good practice to always test with smaller prints first, like business cards or brochures, before going large-scale with posters or banners. That way, you can make adjustments without the risk of wasting too much. Always check the proofs for any discrepancies, especially with colors, before finalizing!

AngelEdwards
New Contributor II

Online kasรญna sa stรกvajรบ ฤoraz pohodlnejลกรญmi pre pouลพรญvateฤพov vฤaka zavรกdzaniu modernรฝch technolรณgiรญ. Naprรญklad jedna z dรดleลพitรฝch funkciรญ je import sรบborov. Mnohรฉ platformy umoลพลˆujรบ nahrรกvaลฅ histรณriu transakciรญ, sprรกvy o vรฝhrach a dokonca aj dokumenty na overenie identity. To vรฝrazne zjednoduลกuje proces registrรกcie a potvrdenia รบdajov, ako aj pomรกha hrรกฤom sledovaลฅ ich hernรบ aktivitu. Ak potrebujete overiลฅ svoju identitu, staฤรญ nahraลฅ pas alebo inรฉ dokumenty cez pohodlnรฉ rozhranie. Takรฉto inovรกcie robia pouลพรญvanie online kasรญn bezpeฤnรฝm a pohodlnรฝm pre vลกetkรฝch รบฤastnรญkov. Ak chcete zistiลฅ viac o modernรฝch funkciรกch a moลพnostiach, ktorรฉ ponรบkajรบ najlepลกie platformy pre hazardnรฉ hry, urฤite navลกtรญvte zahranicneonlinecasino.org . Nรกjdete tam podrobnรฉ recenzie, rady pri vรฝbere bezpeฤnรฝch a spoฤพahlivรฝch kasรญn, ako aj informรกcie o novรฝch akciรกch a bonusoch.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group