cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Problem with dropDuplicates in Databricks runtime 15.4LTS

juan_barreto
New Contributor III

Hi,
I'm testing the latest version of the databricks runtime but I'm getting errors doing a simple dropDuplicates.

Using the following code

data = spark.read.table("some_table")
data.dropDuplicates(subset=['SOME_COLUMN']).count()

 I'm getting this error.

TypeError                                 Traceback (most recent call last)
File <command-934417477504931>, line 1
----> 1 data.dropDuplicates(subset=['SOME_COLUMN']).count()

File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function.<locals>.wrapper(*args, **kwargs)
     45 start = time.perf_counter()
     46 try:
---> 47     res = func(*args, **kwargs)
     48     logger.log_success(
     49         module_name, class_name, function_name, time.perf_counter() - start, signature
     50     )
     51     return res

TypeError: DataFrame.dropDuplicates() got an unexpected keyword argument 'subset'

It works fine if I only pass the list without using it as a keyword argument.

It looks like they changed the function definition to receive a varargs instead of a list but this broke a lot of code for us.

juan_barreto_0-1726153266526.png

Does somebody else have this problem?

6 REPLIES 6

kellys
New Contributor II

Wanted to add to this thread. Seeing the same issue. This appears to be recent problem.

RodriGonca
New Contributor II

Same thing here, broke a lot of code.

Witold
Contributor III

What happens if you avoid passing it as a named parameter? Like:

data.dropDuplicates(['SOME_COLUMN']).count()

juan_barreto
New Contributor III

Hi, As I said, doing that works. But it broke a really big codebase.

Databricks should not be changing the public API of a function in a "stable" release.

Exactly what @juan_barreto  said. The public api should be a contract that we can trust and it shouldn't be changed lightly. Imagine codebase with hundreds of notebooks and a developer team agreed to follow convention to use keyword arguments in that particular function. Now you have a problem. Solution is simple, but you need to rewrite your whole codebase.

Witold
Contributor III

Unless is was communicated as a breaking changes between major updates, it would be OK. But I can't find anything in the release notes, so it's a bug.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group