Problem with dropDuplicates in Databricks runtime ...

juan_barreto · ‎09-12-2024

Hi,
I'm testing the latest version of the databricks runtime but I'm getting errors doing a simple dropDuplicates.

Using the following code

data = spark.read.table("some_table")
data.dropDuplicates(subset=['SOME_COLUMN']).count()

I'm getting this error.

TypeError                                 Traceback (most recent call last)
File <command-934417477504931>, line 1
----> 1 data.dropDuplicates(subset=['SOME_COLUMN']).count()

File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function.<locals>.wrapper(*args, **kwargs)
     45 start = time.perf_counter()
     46 try:
---> 47     res = func(*args, **kwargs)
     48     logger.log_success(
     49         module_name, class_name, function_name, time.perf_counter() - start, signature
     50     )
     51     return res

TypeError: DataFrame.dropDuplicates() got an unexpected keyword argument 'subset'

It works fine if I only pass the list without using it as a keyword argument.

It looks like they changed the function definition to receive a varargs instead of a list but this broke a lot of code for us.

Does somebody else have this problem?

Problem with dropDuplicates in Databricks runtime 15.4LTS