Databricks

SohelKhan · ‎02-22-2016

Pyspark 1.6: DataFrame: Converting one column from string to float/double

I have two columns in a dataframe both of which are loaded as string.

DF = rawdata.select('house name', 'price')

I want to convert DF.price to float.

DF = rawdata.select('house name', float('price')) #did not work

DF[DF.price = float(DF.price)) # did not work

DF.price = DF.price.astype(float) # Panda like script did not work

Would you please help to convert it in Dataframe?

I know how to convert in the RDD: DF.map(lambda x: float(x.price)

But, I am trying to do all the conversion in the Dataframe.

Note: My platform does not have the same interface as the Databrick platform, in which you can change the column type during loading the file.

raela · ‎03-11-2016

The

cast

function can convert the specified columns into different dataTypes. You shouldn't need a UDF to do this. If rawdata is a DataFrame, this should work:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast

df = rawdata.select(col('house name'), rawdata.price.cast('float').alias('price'))

View solution in original post

zjffdu · ‎02-24-2016

You can use udf to do that. But unfortunately , there's no builtin for this type conversion.

sqlContext.udf.register("float",lambda x:float(x))

from pyspark.sql.functions import expr

DF = rawdata.select('house name', expr(float('price'))

SohelKhan · ‎02-27-2016

I fixed it as follows:

from pyspark.sql.functions import udf

from pyspark.sql.types import StringType

def string_to_float(x):

  return float(x)

udfstring_to_float = udf(string_to_float, StringType())

rawdata.withColumn("name",udfstring_to_float("numberfloat") )

Out[8]: DataFrame[name: string, number_int: int, numberfloat: double]

SohelKhan · ‎02-28-2016

Thanks for the suggestion. Sorry though, it did not work.

from pyspark.sql.functions import udf

sqlContext.udf.register("float",lambda x:float(x))

from pyspark.sql.functions import expr

DF = rawdata.select('name', expr(float('numberfloat')))

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

<ipython-input-13-243d7c9f050e> in <module>()

4from pyspark.sql.functions import expo

5---->

6DF = rawdata.select('name', expr(float('numberfloat')))

ValueError: could not convert string to float: numberfloat

raela · ‎03-11-2016

The

cast

function can convert the specified columns into different dataTypes. You shouldn't need a UDF to do this. If rawdata is a DataFrame, this should work:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast

df = rawdata.select(col('house name'), rawdata.price.cast('float').alias('price'))

AidanCondron · ‎01-11-2017

Slightly simpler:

df_num = df.select(df.employment.cast("float"),

df.education.cast("float"),

df.health.cast("float"))

This works with multiple columns, three shown here.

Databricks

Pyspark DataFrame: Converting one column from string to float/double

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI