Databricks

Chrispy · ‎04-08-2024

I have a very strange thing happening. I'm importing a csv file and nulls and blanks are being interpreted correctly. What is strange is that a column that regularly has a single space character value is having the single space converted to null.

I'm using this to import the file data:

df = spark.read.format("csv").options(mode='FAILFAST', multiLine=True, escape='"').csv(path=source_path, header=True, inferSchema=False).select("*", "_metadata.file_name").withColumns({"date1": lit(currentdt), "GUID": lit(guid)})

I've tried including option nullValue=" " but then all blanks are being imported as blank.

I'm new to this and sure I'm doing something wrong.

Kaniz · ‎04-09-2024

Hi @Chrispy,

Handling Empty Cells as NULLs: When importing data from a CSV file, you want to treat empty cells as NULL values. This is a common requirement, especially when dealing with databases. Let’s explore a couple of approaches to achieve this:

Option 1: Retain Null Values from the Source: In your Spark code, you can enable the option to retain null values from the source. This ensures that empty cells are interpreted as NULLs in your DataFrame. Modify your code like this:

df = spark.read.format("csv").options(
    mode='FAILFAST',
    multiLine=True,
    escape='"',
    nullValue='',  # Retain null values from the source
).csv(path=source_path, header=True, inferSchema=False).select("*", "_metadata.file_name").withColumns({"date1": lit(currentdt), "GUID": lit(guid)})

Option 2: Using Derived Columns: Another approach is to add a derived column that checks if a value is an empty string and replaces it with NULL. Here’s how you can do it:

from pyspark.sql.functions import col, when

df = spark.read.format("csv").options(
    mode='FAILFAST',
    multiLine=True,
    escape='"',
).csv(path=source_path, header=True, inferSchema=False).select("*", "_metadata.file_name").withColumns({"date1": lit(currentdt), "GUID": lit(guid)})

# Add a derived column to handle empty cells
df = df.withColumn("new_column", when(col("your_column") == "", None).otherwise(col("your_column")))

Adjust the column names ("your_column") according to your actual dataset.

Additional Considerations:
- Make sure to replace "your_column" with the actual column name you want to handle.
- If you encounter any other issues or need further assistance, feel free to ask!

Remember that handling NULLs and empty cells correctly is crucial for data integrity and accurate analysis. Hopefully, one of these approaches will resolve your issue! 🚀¹ ².

Databricks

Spark Handling White Space as NULL

Databricks Community Social - May 2024

🔔 Attention Databricks Academy Users: SSO Implementation Incoming! Secure Your Account Today!

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs