cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark Handling White Space as NULL

Chrispy
New Contributor

I have a very strange thing happening.  I'm importing a csv file and nulls and blanks are being interpreted correctly.  What is strange is that a column that regularly has a single space character value is having the single space converted to null.

I'm using this to import the file data:

df = spark.read.format("csv").options(mode='FAILFAST', multiLine=True, escape='"').csv(path=source_path, header=True, inferSchema=False).select("*", "_metadata.file_name").withColumns({"date1": lit(currentdt), "GUID": lit(guid)})

I've tried including option nullValue=" " but then all blanks are being imported as blank.

I'm new to this and sure I'm doing something wrong.

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @Chrispy

  1. Handling Empty Cells as NULLs: When importing data from a CSV file, you want to treat empty cells as NULL values. This is a common requirement, especially when dealing with databases. Let’s explore a couple of approaches to achieve this:

    • Option 1: Retain Null Values from the Source: In your Spark code, you can enable the option to retain null values from the source. This ensures that empty cells are interpreted as NULLs in your DataFrame. Modify your code like this:

      df = spark.read.format("csv").options(
          mode='FAILFAST',
          multiLine=True,
          escape='"',
          nullValue='',  # Retain null values from the source
      ).csv(path=source_path, header=True, inferSchema=False).select("*", "_metadata.file_name").withColumns({"date1": lit(currentdt), "GUID": lit(guid)})
      
    • Option 2: Using Derived Columns: Another approach is to add a derived column that checks if a value is an empty string and replaces it with NULL. Here’s how you can do it:

      from pyspark.sql.functions import col, when
      
      df = spark.read.format("csv").options(
          mode='FAILFAST',
          multiLine=True,
          escape='"',
      ).csv(path=source_path, header=True, inferSchema=False).select("*", "_metadata.file_name").withColumns({"date1": lit(currentdt), "GUID": lit(guid)})
      
      # Add a derived column to handle empty cells
      df = df.withColumn("new_column", when(col("your_column") == "", None).otherwise(col("your_column")))
      

    Adjust the column names ("your_column") according to your actual dataset.

  2. Additional Considerations:

    • Make sure to replace "your_column" with the actual column name you want to handle.
    • If you encounter any other issues or need further assistance, feel free to ask!

Remember that handling NULLs and empty cells correctly is crucial for data integrity and accurate analysis. Hopefully, one of these approaches will resolve your issue! 🚀12.