cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark DataFrame: Converting one column from string to float/double

SohelKhan
New Contributor II

Pyspark 1.6: DataFrame: Converting one column from string to float/double

I have two columns in a dataframe both of which are loaded as string.

DF = rawdata.select('house name', 'price')

I want to convert DF.price to float.

DF = rawdata.select('house name', float('price')) #did not work

DF[DF.price = float(DF.price)) # did not work

DF.price = DF.price.astype(float) # Panda like script did not work

Would you please help to convert it in Dataframe?

I know how to convert in the RDD: DF.map(lambda x: float(x.price)

But, I am trying to do all the conversion in the Dataframe.

Note: My platform does not have the same interface as the Databrick platform, in which you can change the column type during loading the file.

1 ACCEPTED SOLUTION

Accepted Solutions

raela
New Contributor III

The

cast

function can convert the specified columns into different dataTypes. You shouldn't need a UDF to do this. If rawdata is a DataFrame, this should work:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast

df = rawdata.select(col('house name'), rawdata.price.cast('float').alias('price'))

View solution in original post

5 REPLIES 5

zjffdu
New Contributor II

You can use udf to do that. But unfortunately , there's no builtin for this type conversion.

sqlContext.udf.register("float",lambda x:float(x))

from pyspark.sql.functions import expr

DF = rawdata.select('house name', expr(float('price'))

SohelKhan
New Contributor II

I fixed it as follows:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def string_to_float(x):
  return float(x)

udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("name",udfstring_to_float("numberfloat") )

Out[8]: DataFrame[name: string, number_int: int, numberfloat: double]

SohelKhan
New Contributor II

Thanks for the suggestion. Sorry though, it did not work.

from pyspark.sql.functions import udf

sqlContext.udf.register("float",lambda x:float(x))

from pyspark.sql.functions import expr

DF = rawdata.select('name', expr(float('numberfloat')))

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

<ipython-input-13-243d7c9f050e> in <module>()

4from pyspark.sql.functions import expo

5---->

6DF = rawdata.select('name', expr(float('numberfloat')))

ValueError: could not convert string to float: numberfloat

raela
New Contributor III

The

cast

function can convert the specified columns into different dataTypes. You shouldn't need a UDF to do this. If rawdata is a DataFrame, this should work:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast

df = rawdata.select(col('house name'), rawdata.price.cast('float').alias('price'))

AidanCondron
New Contributor II

Slightly simpler:

df_num = df.select(df.employment.cast("float"),

df.education.cast("float"),

df.health.cast("float"))

This works with multiple columns, three shown here.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group