03-01-2022 03:50 AM
Hi, I am very new in databricks and I am trying to run quick experiments to understand the best practice for me, my colleagues and the company.
I pull the data from snowflake
df = spark.read \
.format("snowflake") \
.options(**options) \
.option('query', query) \
.load()
Check the data type of the features with prinSchema()
convert to pandas with
df.to_pandas_on_spark()
and I have the FIRST PROBLEM that all the column become 'object' type
I convert the column in float /int
and I run a simple RandomForest classifier
from sklearn.ensemble import RandomForestClassifier as srf
model = srf()
X = df[['col_float]]
y=df['label']
model.fit(X, y)
and here I have the SECOND PROBLEM I keep receiving this error
ValueError: could not convert string to float: 'col_float'
I have been looking at different tutorials, trying different things. I think it might be something silly because I am naive in databricks but I am wasting so much time.
Does anyone had the some issue or knows what is happening?
03-01-2022 03:57 AM
03-01-2022 03:57 AM
03-01-2022 05:02 AM
Hi, thanks for replying. I did check, but nothing changed.
I still have both problem, when I convert to pandas everything is still an object
and then I convert the column but still i have that valueerror
03-01-2022 06:09 AM
Can you check what types the df has before converting it to pandas?
Then check here how this would translate in pandas.
03-01-2022 07:27 AM
it is a pyspark.sqldataframe.dataframe
to convert to pandas I have tried with
df.to_pandas_on_spark()
df.toPandas()
and
import pyspark.pandas as ps
ps.DataFrame(df)
all of them same result with everything becoming an object.
but at the same time why also after i convert the columns into float I get the error that can't convert string to float
03-01-2022 07:30 AM
clearly the conversion is not what you expect.
What I mean is: can you check the schema of the dataframe (pyspark dataframe) and see what column types it has.
Because depending on this pandas will cast them or put them into object type.
03-01-2022 08:26 AM
the schema of the spark dataframe is perfectly fine with all the features different (date, string, decimal)
03-01-2022 08:33 AM
date translates to object,
string translates to object,
decimal translates to object
(see link I posted)
This is normal behavior.
You should convert the object type in pandas,
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html
03-01-2022 09:06 AM
Ok, understood the trasformation to pandas, thank you :).
But since I had everything in an object format I always converted all the columns to the correct format using astype(format)
so when I run df.dtypes I see the correct format
but still when I try to fit a model it gives me the ValueError: could not convert string to float: 'name of the first feature'
03-02-2022 12:15 AM
could it be the comma's and thousand separators?
https://stackoverflow.com/questions/39125665/cannot-convert-string-to-float-in-pandas-valueerror
03-02-2022 01:33 AM
This is the weird thing. The column is already being transformed in float and you can see that when you call dtypes, so if I try to do one of these methods to check commas or anything else it says
"Cannot call StringMethods on type FloatType"
but the same error when I try to fit the model. To make it easy I am trying to fit a model with only 1 feature.
To me seems that the error is about the name of the column like it is trying to fit the name of the column. Usually when print the ValueError should give you the string/ value that cannot convert to float, and in this case it give me the name of the column
03-02-2022 05:04 AM
I can add that if I convert the data type in spark
if I use toPandas() --> then it works
if I use to_pandas_on_spark() -->same error
05-04-2022 09:16 AM
did you figure this one out @Enrico Cascavilla ?
06-07-2022 09:23 AM
Hi @Enrico Cascavilla,
Just a friendly follow-up. Did you were able to find the solution or your still are looking for help? If you did find the solution, please mark it as best.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group