cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Couldn't convert string to float when fit model

enri_casca
New Contributor III

Hi, I am very new in databricks and I am trying to run quick experiments to understand the best practice for me, my colleagues and the company.

I pull the data from snowflake

df = spark.read \

  .format("snowflake") \

  .options(**options) \

  .option('query', query) \

  .load()

Check the data type of the features with prinSchema()

convert to pandas with

df.to_pandas_on_spark()

and I have the FIRST PROBLEM that all the column become 'object' type

I convert the column in float /int

and I run a simple RandomForest classifier

from sklearn.ensemble import RandomForestClassifier as srf

model = srf()

X = df[['col_float]]

y=df['label']

model.fit(X, y)

and here I have the SECOND PROBLEM I keep receiving this error

ValueError: could not convert string to float: 'col_float'

I have been looking at different tutorials, trying different things. I think it might be something silly because I am naive in databricks but I am wasting so much time.

Does anyone had the some issue or knows what is happening?

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III
13 REPLIES 13

-werners-
Esteemed Contributor III

can you check this SO topic?

enri_casca
New Contributor III

Hi, thanks for replying. I did check, but nothing changed.

I still have both problem, when I convert to pandas everything is still an object

and then I convert the column but still i have that valueerror

-werners-
Esteemed Contributor III

Can you check what types the df has before converting it to pandas?

Then check here how this would translate in pandas.

it is a pyspark.sqldataframe.dataframe

to convert to pandas I have tried with

df.to_pandas_on_spark()

df.toPandas()

and

import pyspark.pandas as ps

ps.DataFrame(df)

all of them same result with everything becoming an object.

but at the same time why also after i convert the columns into float I get the error that can't convert string to float

-werners-
Esteemed Contributor III

clearly the conversion is not what you expect.

What I mean is: can you check the schema of the dataframe (pyspark dataframe) and see what column types it has.

Because depending on this pandas will cast them or put them into object type.

the schema of the spark dataframe is perfectly fine with all the features different (date, string, decimal)

-werners-
Esteemed Contributor III

date translates to object,

string translates to object,

decimal translates to object

(see link I posted)

This is normal behavior.

You should convert the object type in pandas,

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html

Ok, understood the trasformation to pandas, thank you :).

But since I had everything in an object format I always converted all the columns to the correct format using astype(format)

so when I run df.dtypes I see the correct format

but still when I try to fit a model it gives me the ValueError: could not convert string to float: 'name of the first feature'

-werners-
Esteemed Contributor III

This is the weird thing. The column is already being transformed in float and you can see that when you call dtypes, so if I try to do one of these methods to check commas or anything else it says

"Cannot call StringMethods on type FloatType"

but the same error when I try to fit the model. To make it easy I am trying to fit a model with only 1 feature.

To me seems that the error is about the name of the column like it is trying to fit the name of the column. Usually when print the ValueError should give you the string/ value that cannot convert to float, and in this case it give me the name of the column

I can add that if I convert the data type in spark

if I use toPandas() --> then it works

if I use to_pandas_on_spark() -->same error

Dan_Z
Databricks Employee
Databricks Employee

did you figure this one out @Enrico Cascavilla​ ?

Hi @Enrico Cascavilla​,

Just a friendly follow-up. Did you were able to find the solution or your still are looking for help? If you did find the solution, please mark it as best.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group