cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

BUG - withColumns in pyspark doesn't handle empty dictionary

Dhruv-22
Contributor II

Today, while reading a delta load my notebook failed and I wanted to report a bug. The withColumns command does not tolerate an empty dictionary and gives the following error in PySpark.

flat_tuple = namedtuple("flat_tuple", ["old_col", "new_col", "logic"])

# flat_tuple(old_col, new_col, logic)
flat_tuples = [
flat_tuple("Coordinates", "Coordinates", extract_coordinates_udf(col("Coordinates")["coordinates"]))
, flat_tuple("CreatedById", "CreatedById", col("CreatedById")["$oid"])
, flat_tuple("CreationDate", "CreationDate", col("CreationDate")["$date"]["$numberLong"])
, flat_tuple("Names", "Names", col("Names")[0]["LanguageValue"])
, flat_tuple("Location", "LocationCoordinates", extract_coordinates_udf(col("Location")["coordinates"]))
, flat_tuple("Location", "LocationType", col("Location")["type"])
, flat_tuple("_id", "sectorId", col("_id")["$oid"])
]

final_flat_cols = {tup.new_col: tup.logic for tup in flat_tuples if tup.old_col in df.columns}
df = df.withColumns(final_flat_cols)

-- Output
AssertionError: [Trace ID: 00-68d8e7cacb471da60efe65d0ef17703d-a3b270f251715df4-00]

This case is handled in normal PySpark and I don't want to write a special if-else clause to check for the columns of dataframe before running withColumns. It would be great if it could be handled internally.

 

Currently, I'm using the following to handle this

flat_col_lst = [tup.logic.alias(tup.new_col) for tup in flat_tuples if tup.old_col in df.columns]
df = df.select('*', *flat_col_lst)
6 REPLIES 6

K_Anudeep
Databricks Employee
Databricks Employee

Hello @Dhruv-22 , 

I have tested this internally, and this seems to be a bug with the new Serverless env version 4 

K_Anudeep_0-1761907767534.png

As a solution, you can try switching the version to 3 as shown bleow and re-run the above code, and it should work. 

K_Anudeep_1-1761907968909.png

Anudeep

Hey @K_Anudeep 

I tried using Environment Version 3, 2, and 1 but still got the same error. Attached is a screenshot with version 3.

Dhruv22_0-1761912353654.png

K_Anudeep
Databricks Employee
Databricks Employee

Hey @Dhruv-22 

Did you apply the version and create a new session/clear the existing session before running it? It should work on Env version 3 as mentioned in my repro below.

 

Anudeep

Yeah, I created a new session. I tried it 3-4 times.

K_Anudeep
Databricks Employee
Databricks Employee

Sure! let me try once again and get back 

Anudeep

Hey @K_Anudeep, did you get anything?