Databricks Community

juan_barreto · ‎09-12-2024

Hi,
I'm testing the latest version of the databricks runtime but I'm getting errors doing a simple dropDuplicates.

Using the following code

data = spark.read.table("some_table")
data.dropDuplicates(subset=['SOME_COLUMN']).count()

I'm getting this error.

TypeError                                 Traceback (most recent call last)
File <command-934417477504931>, line 1
----> 1 data.dropDuplicates(subset=['SOME_COLUMN']).count()

File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function.<locals>.wrapper(*args, **kwargs)
     45 start = time.perf_counter()
     46 try:
---> 47     res = func(*args, **kwargs)
     48     logger.log_success(
     49         module_name, class_name, function_name, time.perf_counter() - start, signature
     50     )
     51     return res

TypeError: DataFrame.dropDuplicates() got an unexpected keyword argument 'subset'

It works fine if I only pass the list without using it as a keyword argument.

It looks like they changed the function definition to receive a varargs instead of a list but this broke a lot of code for us.

Does somebody else have this problem?

kellys · ‎09-13-2024

Wanted to add to this thread. Seeing the same issue. This appears to be recent problem.

RodriGonca · ‎09-16-2024

Same thing here, broke a lot of code.

Witold · ‎09-16-2024

What happens if you avoid passing it as a named parameter? Like:

data.dropDuplicates(['SOME_COLUMN']).count()

juan_barreto · ‎09-16-2024

Hi, As I said, doing that works. But it broke a really big codebase.

Databricks should not be changing the public API of a function in a "stable" release.

szymon_dybczak · ‎09-16-2024

Exactly what @juan_barreto said. The public api should be a contract that we can trust and it shouldn't be changed lightly. Imagine codebase with hundreds of notebooks and a developer team agreed to follow convention to use keyword arguments in that particular function. Now you have a problem. Solution is simple, but you need to rewrite your whole codebase.

Witold · ‎09-16-2024

Unless is was communicated as a breaking changes between major updates, it would be OK. But I can't find anything in the release notes, so it's a bug.

Databricks Community

Problem with dropDuplicates in Databricks runtime 15.4LTS

li.media.uploader-dialog.title

Join Us as a Local Community Builder!

Business Intelligence in the Era of AI

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Databricks Community Champion - March 2025 - Takuya Omi

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.