Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Running a K-means (.fit) gives error:Params must be either a param map or a list/tuple of param maps but got %s." % type(params)

New Contributor III

 am running a k-means algorithm. My feature are DoubleType and have no nulls, but I get : raise TypeError("Params must be either a param map or a list/tuple of param maps but got %s." % type(params). Anyone have any idea how to solve this?

File /databricks/spark/python/pyspark/ml/, in, dataset, params)

203 return self.copy(params)._fit(dataset)

204 else:

--> 205 return self._fit(dataset)

206 else:

207 raise TypeError(

208 "Params must be either a param map or a list/tuple of param maps, "

209 "but got %s." % type(params)

210 )

File /databricks/spark/python/pyspark/ml/, in JavaEstimator._fit(self, dataset)

380 def _fit(self, dataset: DataFrame) -> JM:

--> 381 java_model = self._fit_java(dataset)

382 model = self._create_model(java_model)

383 return self._copyValues(model)

File /databricks/spark/python/pyspark/ml/, in JavaEstimator._fit_java(self, dataset)

375 assert self._java_obj is not None

377 self._transfer_params_to_java()

--> 378 return

File /databricks/spark/python/lib/, in JavaMember.__call__(self, *args) 1316 command = proto.CALL_COMMAND_NAME +\ 1317 self.command_header +\ 1318 args_command +\ 1319 proto.END_COMMAND_PART 1321 answer = self.gateway_client.send_command(command)

-> 1322 return_value = get_return_value( 1323 answer, self.gateway_client, self.target_id, 1325 for temp_arg in temp_args: 1326 if hasattr(temp_arg, "_detach"):

File /databricks/spark/python/pyspark/errors/exceptions/, in capture_sql_exception.<locals>.deco(*a, **kw)

160 def deco(*a: Any, **kw: Any) -> Any:

161 try:

--> 162 return f(*a, **kw)

163 except Py4JJavaError as e:

164 converted = convert_exception(e.java_exception)

File /databricks/spark/python/lib/, in get_return_value(answer, gateway_client, target_id, name)

324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)

325 if answer[1] == REFERENCE_TYPE:

--> 326 raise Py4JJavaError(

327 "An error occurred while calling {0}{1}{2}.\n".

328 format(target_id, ".", name), value)

329 else:

330 raise Py4JError(

331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".

332 format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling : org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 4052.0 failed 4 times, most recent failure: Lost task

9.3 in stage 4052.0 (TID 29072) ( executor 0): java.lang.AssertionError: assertion

failed at...


Not applicable

New Contributor III

I found the answer just by trying several things, although I do not understand exactly what the problem was. All I had to do was to cache the input data before fitting the model:

assemble=VectorAssembler(inputCols=columns_input, outputCol='features')
assembled_data = assembled_data.cache()
KMeans_algo=KMeans(featuresCol='features', k=number_of_clusters)    

