02-22-2016 08:34 AM
Pyspark 1.6: DataFrame: Converting one column from string to float/double
I have two columns in a dataframe both of which are loaded as string.
DF = rawdata.select('house name', 'price')
I want to convert DF.price to float.
DF = rawdata.select('house name', float('price')) #did not work
DF[DF.price = float(DF.price)) # did not work
DF.price = DF.price.astype(float) # Panda like script did not work
Would you please help to convert it in Dataframe?
I know how to convert in the RDD: DF.map(lambda x: float(x.price)
But, I am trying to do all the conversion in the Dataframe.
Note: My platform does not have the same interface as the Databrick platform, in which you can change the column type during loading the file.
03-11-2016 10:03 AM
The
cast
function can convert the specified columns into different dataTypes. You shouldn't need a UDF to do this. If rawdata is a DataFrame, this should work:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast
df = rawdata.select(col('house name'), rawdata.price.cast('float').alias('price'))
02-24-2016 09:55 PM
You can use udf to do that. But unfortunately , there's no builtin for this type conversion.
sqlContext.udf.register("float",lambda x:float(x))
from pyspark.sql.functions import expr
DF = rawdata.select('house name', expr(float('price'))
02-27-2016 03:28 PM
I fixed it as follows:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def string_to_float(x):
return float(x)
udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("name",udfstring_to_float("numberfloat") )
Out[8]: DataFrame[name: string, number_int: int, numberfloat: double]
02-28-2016 10:36 PM
Thanks for the suggestion. Sorry though, it did not work.
from pyspark.sql.functions import udf
sqlContext.udf.register("float",lambda x:float(x))
from pyspark.sql.functions import expr
DF = rawdata.select('name', expr(float('numberfloat')))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-243d7c9f050e> in <module>()
4from pyspark.sql.functions import expo
5---->
6DF = rawdata.select('name', expr(float('numberfloat')))
ValueError: could not convert string to float: numberfloat
03-11-2016 10:03 AM
The
cast
function can convert the specified columns into different dataTypes. You shouldn't need a UDF to do this. If rawdata is a DataFrame, this should work:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast
df = rawdata.select(col('house name'), rawdata.price.cast('float').alias('price'))
01-11-2017 08:31 AM
Slightly simpler:
df_num = df.select(df.employment.cast("float"),
df.education.cast("float"), df.health.cast("float"))This works with multiple columns, three shown here.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group