<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Converting a transformation written in Spark Scala to PySpark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/converting-a-transformation-written-in-spark-scala-to-pyspark/m-p/22985#M15829</link>
    <description>&lt;P&gt;Another follow-up question, if you don't mind. @Pat Sienkiewicz​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;As I was trying to parse the name column into multiple columns. I came across the data below:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;("James,\"A,B\", Smith", "2018",  "M", 3000)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;In order to parse these comma-included middle names, I was using the `from_csv` function.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The Scala Spark code looks like the below:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%scala
// using from_csv function with defined Schema to split the columns.
&amp;nbsp;
val options = Map("sep" -&amp;gt; ",")
&amp;nbsp;
val df_split = df.select($"*", F.from_csv($"name", simpleSchema, options).alias("value_parsed"))
&amp;nbsp;
val df_multi_cols = df_split.select("*", "value_parsed.*").drop("value_parsed")
&amp;nbsp;
df.show(false)
df_multi_cols.show(false)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The schema that's mentioned above is as follows:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%scala
// schema in scala
&amp;nbsp;
val simpleSchema = new StructType()
                    .add("firstName", StringType)
                    .add("middleName",StringType)
                    .add("lastName",StringType)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Now the code that I came up for PySpark is that:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;#Schema in PySpark
simple_schema = (StructType()
                 .add('firstName', StringType())
                 .add('middleName', StringType())
                 .add('lastName', StringType())
                )
options = {'sep':','}
&amp;nbsp;
df_split = df_is.select("*", from_csv(df_is.name, simple_schema, options).alias("value_parsed"))
&amp;nbsp;
#df_split.printSchema()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;This throws up an error: `TypeError: schema argument should be a column or string`&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Now following the error, if I define the schema in the SQL style (in quotes), it works.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;options = {'sep':','}
&amp;nbsp;
df_split = df_is.select("*", from_csv(df_is.name, "firstName string, middleName string, lastName string", options).alias("value_parsed"))
&amp;nbsp;
df_split.printSchema()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I'm intrigued as to why it works in Scala Spark and why not in PySpark. Any leads would be greatly appreciated.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Best,&lt;/P&gt;&lt;P&gt;Riz&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 10 Nov 2022 10:43:37 GMT</pubDate>
    <dc:creator>RiyazAliM</dc:creator>
    <dc:date>2022-11-10T10:43:37Z</dc:date>
    <item>
      <title>Converting a transformation written in Spark Scala to PySpark</title>
      <link>https://community.databricks.com/t5/data-engineering/converting-a-transformation-written-in-spark-scala-to-pyspark/m-p/22982#M15826</link>
      <description>&lt;P&gt;Hello all,&lt;/P&gt;&lt;P&gt;I've been tasked to convert a Scala Spark code to PySpark code with minimal changes (kinda literal translation).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I've come across some code that claims to be a list comprehension. Look below for code snippet:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%scala
val desiredColumn = Seq("firstName", "middleName", "lastName")
val colSize = desiredColumn.size
&amp;nbsp;
val columnList = for (i &amp;lt;- 0 until colSize) yield $"elements".getItem(i).alias(desiredColumn(i))
&amp;nbsp;
print(columnList)
&amp;nbsp;
// df_nameSplit.select(columnList: _ *).show(false)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Output for this code snippet:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Vector(elements[0] AS firstName, elements[1] AS middleName, elements[2] AS lastName)desiredColumn: Seq[String] = List(firstName, middleName, lastName)
colSize: Int = 3
columnList: scala.collection.immutable.IndexedSeq[org.apache.spark.sql.Column] = Vector(elements[0] AS firstName, elements[1] AS middleName, elements[2] AS lastName)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Also, the schema of the `df_nameSplit` data frame is as below and the elements column is a split version of the `name` column:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;root
 |-- name: string (nullable = true)
 |-- dob_year: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- elements: array (nullable = true)
 |    |-- element: string (containsNull = false)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The PySpark version of the code I was able to come-up with:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;desired_columns = ["firstName", "middleName", "lastName"]
&amp;nbsp;
col_size = len(desired_columns)
&amp;nbsp;
col_list = [df_nameSplit.select(col("elements").getItem(i).alias(desired_columns[i])) for i in range(col_size)]
&amp;nbsp;
print(col_list)
&amp;nbsp;
# df_nameSplit.select(*col_list).display()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Output for PySpark code:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;[DataFrame[firstName: string], DataFrame[middleName: string], DataFrame[lastName: string]]&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Could someone help me with where I'm going wrong?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Tagging @Kaniz Fatma​&amp;nbsp;for better reach!&lt;/P&gt;</description>
      <pubDate>Wed, 09 Nov 2022 14:59:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/converting-a-transformation-written-in-spark-scala-to-pyspark/m-p/22982#M15826</guid>
      <dc:creator>RiyazAliM</dc:creator>
      <dc:date>2022-11-09T14:59:13Z</dc:date>
    </item>
    <item>
      <title>Re: Converting a transformation written in Spark Scala to PySpark</title>
      <link>https://community.databricks.com/t5/data-engineering/converting-a-transformation-written-in-spark-scala-to-pyspark/m-p/22983#M15827</link>
      <description>&lt;P&gt;Hi @Riyaz Ali​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;check this one:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;desired_columns = ["firstName", "middleName", "lastName"]
 
col_size = len(desired_columns)
 
col_list = [col("elements").getItem(i).alias(desired_columns[i]) for i in range(col_size)]
 
print(col_list)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;the output is:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;[Column&amp;lt;'elements[0] AS firstName'&amp;gt;, Column&amp;lt;'elements[1] AS middleName'&amp;gt;, Column&amp;lt;'elements[2] AS lastName'&amp;gt;]&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;test:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql.types import StringType, ArrayType
arrayCol = ArrayType(StringType(),False)
&amp;nbsp;
schema = StructType([ 
    StructField("id",StringType(),True), 
    StructField("elements",ArrayType(StringType()),True)
  ])
&amp;nbsp;
&amp;nbsp;
data = [
 ("1",["john","jack","doe"])
]
&amp;nbsp;
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show()
&amp;nbsp;
root
 |-- id: string (nullable = true)
 |-- elements: array (nullable = true)
 |    |-- element: string (containsNull = true)
&amp;nbsp;
+---+-----------------+
| id|         elements|
+---+-----------------+
|  1|[john, jack, doe]|
+---+-----------------+
&amp;nbsp;
df.select(*col_list).display()
&amp;nbsp;
&amp;nbsp;
output:
+---------+----------+--------+
|firstName|middleName|lastName|
+---------+----------+--------+
|john     |jack      |doe     |
+---------+----------+--------+&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 09 Nov 2022 21:39:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/converting-a-transformation-written-in-spark-scala-to-pyspark/m-p/22983#M15827</guid>
      <dc:creator>Pat</dc:creator>
      <dc:date>2022-11-09T21:39:22Z</dc:date>
    </item>
    <item>
      <title>Re: Converting a transformation written in Spark Scala to PySpark</title>
      <link>https://community.databricks.com/t5/data-engineering/converting-a-transformation-written-in-spark-scala-to-pyspark/m-p/22984#M15828</link>
      <description>&lt;P&gt;Thank you @Pat Sienkiewicz​&amp;nbsp;!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This makes a whole lotta sense! Not sure why I was selecting from a data frame when all I needed were columns.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 10 Nov 2022 09:38:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/converting-a-transformation-written-in-spark-scala-to-pyspark/m-p/22984#M15828</guid>
      <dc:creator>RiyazAliM</dc:creator>
      <dc:date>2022-11-10T09:38:29Z</dc:date>
    </item>
    <item>
      <title>Re: Converting a transformation written in Spark Scala to PySpark</title>
      <link>https://community.databricks.com/t5/data-engineering/converting-a-transformation-written-in-spark-scala-to-pyspark/m-p/22985#M15829</link>
      <description>&lt;P&gt;Another follow-up question, if you don't mind. @Pat Sienkiewicz​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;As I was trying to parse the name column into multiple columns. I came across the data below:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;("James,\"A,B\", Smith", "2018",  "M", 3000)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;In order to parse these comma-included middle names, I was using the `from_csv` function.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The Scala Spark code looks like the below:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%scala
// using from_csv function with defined Schema to split the columns.
&amp;nbsp;
val options = Map("sep" -&amp;gt; ",")
&amp;nbsp;
val df_split = df.select($"*", F.from_csv($"name", simpleSchema, options).alias("value_parsed"))
&amp;nbsp;
val df_multi_cols = df_split.select("*", "value_parsed.*").drop("value_parsed")
&amp;nbsp;
df.show(false)
df_multi_cols.show(false)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The schema that's mentioned above is as follows:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%scala
// schema in scala
&amp;nbsp;
val simpleSchema = new StructType()
                    .add("firstName", StringType)
                    .add("middleName",StringType)
                    .add("lastName",StringType)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Now the code that I came up for PySpark is that:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;#Schema in PySpark
simple_schema = (StructType()
                 .add('firstName', StringType())
                 .add('middleName', StringType())
                 .add('lastName', StringType())
                )
options = {'sep':','}
&amp;nbsp;
df_split = df_is.select("*", from_csv(df_is.name, simple_schema, options).alias("value_parsed"))
&amp;nbsp;
#df_split.printSchema()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;This throws up an error: `TypeError: schema argument should be a column or string`&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Now following the error, if I define the schema in the SQL style (in quotes), it works.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;options = {'sep':','}
&amp;nbsp;
df_split = df_is.select("*", from_csv(df_is.name, "firstName string, middleName string, lastName string", options).alias("value_parsed"))
&amp;nbsp;
df_split.printSchema()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I'm intrigued as to why it works in Scala Spark and why not in PySpark. Any leads would be greatly appreciated.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Best,&lt;/P&gt;&lt;P&gt;Riz&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 10 Nov 2022 10:43:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/converting-a-transformation-written-in-spark-scala-to-pyspark/m-p/22985#M15829</guid>
      <dc:creator>RiyazAliM</dc:creator>
      <dc:date>2022-11-10T10:43:37Z</dc:date>
    </item>
  </channel>
</rss>

