Another follow-up question, if you don't mind. @Pat Sienkiewicz
As I was trying to parse the name column into multiple columns. I came across the data below:
("James,\"A,B\", Smith", "2018", "M", 3000)
In order to parse these comma-included middle names, I was using the `from_csv` function.
The Scala Spark code looks like the below:
%scala
// using from_csv function with defined Schema to split the columns.
val options = Map("sep" -> ",")
val df_split = df.select($"*", F.from_csv($"name", simpleSchema, options).alias("value_parsed"))
val df_multi_cols = df_split.select("*", "value_parsed.*").drop("value_parsed")
df.show(false)
df_multi_cols.show(false)
The schema that's mentioned above is as follows:
%scala
// schema in scala
val simpleSchema = new StructType()
.add("firstName", StringType)
.add("middleName",StringType)
.add("lastName",StringType)
Now the code that I came up for PySpark is that:
#Schema in PySpark
simple_schema = (StructType()
.add('firstName', StringType())
.add('middleName', StringType())
.add('lastName', StringType())
)
options = {'sep':','}
df_split = df_is.select("*", from_csv(df_is.name, simple_schema, options).alias("value_parsed"))
#df_split.printSchema()
This throws up an error: `TypeError: schema argument should be a column or string`
Now following the error, if I define the schema in the SQL style (in quotes), it works.
options = {'sep':','}
df_split = df_is.select("*", from_csv(df_is.name, "firstName string, middleName string, lastName string", options).alias("value_parsed"))
df_split.printSchema()
I'm intrigued as to why it works in Scala Spark and why not in PySpark. Any leads would be greatly appreciated.
Best,
Riz