Problem with dropDuplicates in Databricks runtime 15.4LTS
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-12-2024 08:03 AM
Hi,
I'm testing the latest version of the databricks runtime but I'm getting errors doing a simple dropDuplicates.
Using the following code
data = spark.read.table("some_table")
data.dropDuplicates(subset=['SOME_COLUMN']).count()I'm getting this error.
TypeError Traceback (most recent call last)
File <command-934417477504931>, line 1
----> 1 data.dropDuplicates(subset=['SOME_COLUMN']).count()
File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function.<locals>.wrapper(*args, **kwargs)
45 start = time.perf_counter()
46 try:
---> 47 res = func(*args, **kwargs)
48 logger.log_success(
49 module_name, class_name, function_name, time.perf_counter() - start, signature
50 )
51 return res
TypeError: DataFrame.dropDuplicates() got an unexpected keyword argument 'subset'It works fine if I only pass the list without using it as a keyword argument.
It looks like they changed the function definition to receive a varargs instead of a list but this broke a lot of code for us.
Does somebody else have this problem?