Databricks

enri_casca · ‎03-01-2022

Hi, I am very new in databricks and I am trying to run quick experiments to understand the best practice for me, my colleagues and the company.

I pull the data from snowflake

df = spark.read \

.format("snowflake") \

.options(**options) \

.option('query', query) \

.load()

Check the data type of the features with prinSchema()

convert to pandas with

df.to_pandas_on_spark()

and I have the FIRST PROBLEM that all the column become 'object' type

I convert the column in float /int

and I run a simple RandomForest classifier

from sklearn.ensemble import RandomForestClassifier as srf

model = srf()

X = df[['col_float]]

y=df['label']

model.fit(X, y)

and here I have the SECOND PROBLEM I keep receiving this error

ValueError: could not convert string to float: 'col_float'

I have been looking at different tutorials, trying different things. I think it might be something silly because I am naive in databricks but I am wasting so much time.

Does anyone had the some issue or knows what is happening?

-werners- · ‎03-01-2022

can you check this SO topic?

View solution in original post

-werners- · ‎03-01-2022

can you check this SO topic?

enri_casca · ‎03-01-2022

Hi, thanks for replying. I did check, but nothing changed.

I still have both problem, when I convert to pandas everything is still an object

and then I convert the column but still i have that valueerror

-werners- · ‎03-01-2022

Can you check what types the df has before converting it to pandas?

Then check here how this would translate in pandas.

enri_casca · ‎03-01-2022

it is a pyspark.sqldataframe.dataframe

to convert to pandas I have tried with

df.to_pandas_on_spark()

df.toPandas()

and

import pyspark.pandas as ps

ps.DataFrame(df)

all of them same result with everything becoming an object.

but at the same time why also after i convert the columns into float I get the error that can't convert string to float

-werners- · ‎03-01-2022

clearly the conversion is not what you expect.

What I mean is: can you check the schema of the dataframe (pyspark dataframe) and see what column types it has.

Because depending on this pandas will cast them or put them into object type.

enri_casca · ‎03-01-2022

the schema of the spark dataframe is perfectly fine with all the features different (date, string, decimal)

-werners- · ‎03-01-2022

date translates to object,

string translates to object,

decimal translates to object

(see link I posted)

This is normal behavior.

You should convert the object type in pandas,

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html

enri_casca · ‎03-01-2022

Ok, understood the trasformation to pandas, thank you :).

But since I had everything in an object format I always converted all the columns to the correct format using astype(format)

so when I run df.dtypes I see the correct format

but still when I try to fit a model it gives me the ValueError: could not convert string to float: 'name of the first feature'

-werners- · ‎03-02-2022

could it be the comma's and thousand separators?

https://stackoverflow.com/questions/39125665/cannot-convert-string-to-float-in-pandas-valueerror

enri_casca · ‎03-02-2022

This is the weird thing. The column is already being transformed in float and you can see that when you call dtypes, so if I try to do one of these methods to check commas or anything else it says

"Cannot call StringMethods on type FloatType"

but the same error when I try to fit the model. To make it easy I am trying to fit a model with only 1 feature.

To me seems that the error is about the name of the column like it is trying to fit the name of the column. Usually when print the ValueError should give you the string/ value that cannot convert to float, and in this case it give me the name of the column

enri_casca · ‎03-02-2022

I can add that if I convert the data type in spark

if I use toPandas() --> then it works

if I use to_pandas_on_spark() -->same error

Dan_Z · ‎05-04-2022

did you figure this one out @Enrico Cascavilla ?

jose_gonzalez · ‎06-07-2022

Hi @Enrico Cascavilla,

Just a friendly follow-up. Did you were able to find the solution or your still are looking for help? If you did find the solution, please mark it as best.

Databricks

Couldn't convert string to float when fit model

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI