cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark DataFrame: Converting one column from string to float/double

SohelKhan
New Contributor II

Pyspark 1.6: DataFrame: Converting one column from string to float/double

I have two columns in a dataframe both of which are loaded as string.

DF = rawdata.select('house name', 'price')

I want to convert DF.price to float.

DF = rawdata.select('house name', float('price')) #did not work

DF[DF.price = float(DF.price)) # did not work

DF.price = DF.price.astype(float) # Panda like script did not work

Would you please help to convert it in Dataframe?

I know how to convert in the RDD: DF.map(lambda x: float(x.price)

But, I am trying to do all the conversion in the Dataframe.

Note: My platform does not have the same interface as the Databrick platform, in which you can change the column type during loading the file.

1 ACCEPTED SOLUTION

Accepted Solutions

raela
New Contributor III
New Contributor III

The

cast

function can convert the specified columns into different dataTypes. You shouldn't need a UDF to do this. If rawdata is a DataFrame, this should work:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast

df = rawdata.select(col('house name'), rawdata.price.cast('float').alias('price'))

View solution in original post

5 REPLIES 5

zjffdu
New Contributor II

You can use udf to do that. But unfortunately , there's no builtin for this type conversion.

sqlContext.udf.register("float",lambda x:float(x))

from pyspark.sql.functions import expr

DF = rawdata.select('house name', expr(float('price'))

SohelKhan
New Contributor II

I fixed it as follows:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def string_to_float(x):
  return float(x)

udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("name",udfstring_to_float("numberfloat") )

Out[8]: DataFrame[name: string, number_int: int, numberfloat: double]

SohelKhan
New Contributor II

Thanks for the suggestion. Sorry though, it did not work.

from pyspark.sql.functions import udf

sqlContext.udf.register("float",lambda x:float(x))

from pyspark.sql.functions import expr

DF = rawdata.select('name', expr(float('numberfloat')))

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

<ipython-input-13-243d7c9f050e> in <module>()

4from pyspark.sql.functions import expo

5---->

6DF = rawdata.select('name', expr(float('numberfloat')))

ValueError: could not convert string to float: numberfloat

raela
New Contributor III
New Contributor III

The

cast

function can convert the specified columns into different dataTypes. You shouldn't need a UDF to do this. If rawdata is a DataFrame, this should work:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast

df = rawdata.select(col('house name'), rawdata.price.cast('float').alias('price'))

AidanCondron
New Contributor II

Slightly simpler:

df_num = df.select(df.employment.cast("float"),

df.education.cast("float"),

df.health.cast("float"))

This works with multiple columns, three shown here.