Problem with dropDuplicates in Databricks runtime 15.4LTS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-12-2024 08:03 AM
Hi,
I'm testing the latest version of the databricks runtime but I'm getting errors doing a simple dropDuplicates.
Using the following code
data = spark.read.table("some_table")
data.dropDuplicates(subset=['SOME_COLUMN']).count()
I'm getting this error.
TypeError Traceback (most recent call last)
File <command-934417477504931>, line 1
----> 1 data.dropDuplicates(subset=['SOME_COLUMN']).count()
File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function.<locals>.wrapper(*args, **kwargs)
45 start = time.perf_counter()
46 try:
---> 47 res = func(*args, **kwargs)
48 logger.log_success(
49 module_name, class_name, function_name, time.perf_counter() - start, signature
50 )
51 return res
TypeError: DataFrame.dropDuplicates() got an unexpected keyword argument 'subset'
It works fine if I only pass the list without using it as a keyword argument.
It looks like they changed the function definition to receive a varargs instead of a list but this broke a lot of code for us.
Does somebody else have this problem?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-13-2024 07:08 AM
Wanted to add to this thread. Seeing the same issue. This appears to be recent problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-16-2024 07:07 AM
Same thing here, broke a lot of code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-16-2024 08:40 AM
What happens if you avoid passing it as a named parameter? Like:
data.dropDuplicates(['SOME_COLUMN']).count()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-16-2024 09:30 AM
Hi, As I said, doing that works. But it broke a really big codebase.
Databricks should not be changing the public API of a function in a "stable" release.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-16-2024 09:42 AM
Exactly what @juan_barreto said. The public api should be a contract that we can trust and it shouldn't be changed lightly. Imagine codebase with hundreds of notebooks and a developer team agreed to follow convention to use keyword arguments in that particular function. Now you have a problem. Solution is simple, but you need to rewrite your whole codebase.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-16-2024 11:00 PM
Unless is was communicated as a breaking changes between major updates, it would be OK. But I can't find anything in the release notes, so it's a bug.

