- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-22-2016 08:34 AM
Pyspark 1.6: DataFrame: Converting one column from string to float/double
I have two columns in a dataframe both of which are loaded as string.
DF = rawdata.select('house name', 'price')
I want to convert DF.price to float.
DF = rawdata.select('house name', float('price')) #did not work
DF[DF.price = float(DF.price)) # did not work
DF.price = DF.price.astype(float) # Panda like script did not work
Would you please help to convert it in Dataframe?
I know how to convert in the RDD: DF.map(lambda x: float(x.price)
But, I am trying to do all the conversion in the Dataframe.
Note: My platform does not have the same interface as the Databrick platform, in which you can change the column type during loading the file.
- Labels:
-
Conversion
-
Dataframe
-
Pyspark
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-11-2016 10:03 AM
The
cast
function can convert the specified columns into different dataTypes. You shouldn't need a UDF to do this. If rawdata is a DataFrame, this should work:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast
df = rawdata.select(col('house name'), rawdata.price.cast('float').alias('price'))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-24-2016 09:55 PM
You can use udf to do that. But unfortunately , there's no builtin for this type conversion.
sqlContext.udf.register("float",lambda x:float(x))
from pyspark.sql.functions import expr
DF = rawdata.select('house name', expr(float('price'))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-27-2016 03:28 PM
I fixed it as follows:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def string_to_float(x):
return float(x)
udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("name",udfstring_to_float("numberfloat") )
Out[8]: DataFrame[name: string, number_int: int, numberfloat: double]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-28-2016 10:36 PM
Thanks for the suggestion. Sorry though, it did not work.
from pyspark.sql.functions import udf
sqlContext.udf.register("float",lambda x:float(x))
from pyspark.sql.functions import expr
DF = rawdata.select('name', expr(float('numberfloat')))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-243d7c9f050e> in <module>()
4from pyspark.sql.functions import expo
5---->
6DF = rawdata.select('name', expr(float('numberfloat')))
ValueError: could not convert string to float: numberfloat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-11-2016 10:03 AM
The
cast
function can convert the specified columns into different dataTypes. You shouldn't need a UDF to do this. If rawdata is a DataFrame, this should work:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast
df = rawdata.select(col('house name'), rawdata.price.cast('float').alias('price'))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2017 08:31 AM
Slightly simpler:
df_num = df.select(df.employment.cast("float"),
df.education.cast("float"), df.health.cast("float"))This works with multiple columns, three shown here.