<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to Change Schema of a Spark SQL in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-change-schema-of-a-spark-sql/m-p/28572#M20352</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I am new to Spark and just started an online pyspark tutorial. I uploaded the json data in DataBrick and wrote the commands as follows: &lt;/P&gt;
&lt;P&gt;df = sqlContext.sql("SELECT * FROM people_json") &lt;/P&gt;
&lt;P&gt;df.printSchema()&lt;/P&gt;
&lt;P&gt;from pyspark.sql.types import *&lt;/P&gt;
&lt;P&gt;data_schema = [StructField('age',IntegerType(),True), StructField('name',StringType(),True)] &lt;/P&gt;
&lt;P&gt;final_struc = StructType(fields=data_schema)&lt;/P&gt;
&lt;P&gt;&lt;B&gt;###Tutorial says to run this command&lt;/B&gt;&lt;/P&gt;
&lt;P&gt;df = spark.read.json('people_json',schema=final_struc)&lt;/P&gt;
&lt;P&gt;&lt;B&gt;###But this is not working. Why this is not working ? And what will work ? Thanks!&lt;/B&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 15 Aug 2018 05:21:15 GMT</pubDate>
    <dc:creator>Dee</dc:creator>
    <dc:date>2018-08-15T05:21:15Z</dc:date>
    <item>
      <title>How to Change Schema of a Spark SQL</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-change-schema-of-a-spark-sql/m-p/28572#M20352</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I am new to Spark and just started an online pyspark tutorial. I uploaded the json data in DataBrick and wrote the commands as follows: &lt;/P&gt;
&lt;P&gt;df = sqlContext.sql("SELECT * FROM people_json") &lt;/P&gt;
&lt;P&gt;df.printSchema()&lt;/P&gt;
&lt;P&gt;from pyspark.sql.types import *&lt;/P&gt;
&lt;P&gt;data_schema = [StructField('age',IntegerType(),True), StructField('name',StringType(),True)] &lt;/P&gt;
&lt;P&gt;final_struc = StructType(fields=data_schema)&lt;/P&gt;
&lt;P&gt;&lt;B&gt;###Tutorial says to run this command&lt;/B&gt;&lt;/P&gt;
&lt;P&gt;df = spark.read.json('people_json',schema=final_struc)&lt;/P&gt;
&lt;P&gt;&lt;B&gt;###But this is not working. Why this is not working ? And what will work ? Thanks!&lt;/B&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 15 Aug 2018 05:21:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-change-schema-of-a-spark-sql/m-p/28572#M20352</guid>
      <dc:creator>Dee</dc:creator>
      <dc:date>2018-08-15T05:21:15Z</dc:date>
    </item>
    <item>
      <title>Re: How to Change Schema of a Spark SQL</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-change-schema-of-a-spark-sql/m-p/28573#M20353</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;The first part of your query&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;df = sqlContext.sql("SELECT * FROM people_json")
df.printSchema()
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;is create the &lt;PRE&gt;&lt;CODE&gt;df&lt;/CODE&gt;&lt;/PRE&gt; DataFrame by reading an existing table.&lt;/P&gt;
&lt;P&gt;The second part of your query is using &lt;PRE&gt;&lt;CODE&gt;spark.read.json&lt;/CODE&gt;&lt;/PRE&gt; which is expecting a file. For example, the following code does work:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql.types import *
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields=data_schema)
df = spark.read.json("/my/directory/people.json", schema=final_struc)
df.show() &lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;with the output being:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt; +----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;If you would like to change the schema of the table based on your first query, you can&lt;/P&gt;
&lt;P&gt;1. Execute Spark SQL such as&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;df.createOrReplaceTempView("df")
df2 = spark.sql("select cast(age as int) as age, cast(name as string) as name from df")&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;2. Use PySpark DataFrame to cast the column/schemas&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql.types import IntegerType
df2 = df.withColumn("age", df["age"].cast(IntegerType()))&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;HTH!&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 15 Aug 2018 06:48:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-change-schema-of-a-spark-sql/m-p/28573#M20353</guid>
      <dc:creator>dennyglee</dc:creator>
      <dc:date>2018-08-15T06:48:25Z</dc:date>
    </item>
    <item>
      <title>Re: How to Change Schema of a Spark SQL</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-change-schema-of-a-spark-sql/m-p/28574#M20354</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="http://www.bigdatainterview.com/" target="test_blank"&gt;http://www.bigdatainterview.com/&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 20 Jul 2019 17:24:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-change-schema-of-a-spark-sql/m-p/28574#M20354</guid>
      <dc:creator>bhanu2448</dc:creator>
      <dc:date>2019-07-20T17:24:25Z</dc:date>
    </item>
  </channel>
</rss>

