09-12-2024 08:03 AM
Hi,
I'm testing the latest version of the databricks runtime but I'm getting errors doing a simple dropDuplicates.
Using the following code
data = spark.read.table("some_table")
data.dropDuplicates(subset=['SOME_COLUMN']).count()
I'm getting this error.
TypeError Traceback (most recent call last)
File <command-934417477504931>, line 1
----> 1 data.dropDuplicates(subset=['SOME_COLUMN']).count()
File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function.<locals>.wrapper(*args, **kwargs)
45 start = time.perf_counter()
46 try:
---> 47 res = func(*args, **kwargs)
48 logger.log_success(
49 module_name, class_name, function_name, time.perf_counter() - start, signature
50 )
51 return res
TypeError: DataFrame.dropDuplicates() got an unexpected keyword argument 'subset'
It works fine if I only pass the list without using it as a keyword argument.
It looks like they changed the function definition to receive a varargs instead of a list but this broke a lot of code for us.
Does somebody else have this problem?
09-13-2024 07:08 AM
Wanted to add to this thread. Seeing the same issue. This appears to be recent problem.
09-16-2024 07:07 AM
Same thing here, broke a lot of code.
09-16-2024 08:40 AM
What happens if you avoid passing it as a named parameter? Like:
data.dropDuplicates(['SOME_COLUMN']).count()
09-16-2024 09:30 AM
Hi, As I said, doing that works. But it broke a really big codebase.
Databricks should not be changing the public API of a function in a "stable" release.
09-16-2024 09:42 AM
Exactly what @juan_barreto said. The public api should be a contract that we can trust and it shouldn't be changed lightly. Imagine codebase with hundreds of notebooks and a developer team agreed to follow convention to use keyword arguments in that particular function. Now you have a problem. Solution is simple, but you need to rewrite your whole codebase.
09-16-2024 11:00 PM
Unless is was communicated as a breaking changes between major updates, it would be OK. But I can't find anything in the release notes, so it's a bug.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group