cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to Change Schema of a Spark SQL

Dee
New Contributor

I am new to Spark and just started an online pyspark tutorial. I uploaded the json data in DataBrick and wrote the commands as follows:

df = sqlContext.sql("SELECT * FROM people_json")

df.printSchema()

from pyspark.sql.types import *

data_schema = [StructField('age',IntegerType(),True), StructField('name',StringType(),True)]

final_struc = StructType(fields=data_schema)

###Tutorial says to run this command

df = spark.read.json('people_json',schema=final_struc)

###But this is not working. Why this is not working ? And what will work ? Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

dennyglee
Databricks Employee
Databricks Employee

The first part of your query

df = sqlContext.sql("SELECT * FROM people_json")
df.printSchema()

is create the

df
DataFrame by reading an existing table.

The second part of your query is using

spark.read.json
which is expecting a file. For example, the following code does work:

from pyspark.sql.types import *
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields=data_schema)
df = spark.read.json("/my/directory/people.json", schema=final_struc)
df.show() 

with the output being:

 +----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

If you would like to change the schema of the table based on your first query, you can

1. Execute Spark SQL such as

df.createOrReplaceTempView("df")
df2 = spark.sql("select cast(age as int) as age, cast(name as string) as name from df")

2. Use PySpark DataFrame to cast the column/schemas

from pyspark.sql.types import IntegerType
df2 = df.withColumn("age", df["age"].cast(IntegerType()))

HTH!

View solution in original post

2 REPLIES 2

dennyglee
Databricks Employee
Databricks Employee

The first part of your query

df = sqlContext.sql("SELECT * FROM people_json")
df.printSchema()

is create the

df
DataFrame by reading an existing table.

The second part of your query is using

spark.read.json
which is expecting a file. For example, the following code does work:

from pyspark.sql.types import *
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields=data_schema)
df = spark.read.json("/my/directory/people.json", schema=final_struc)
df.show() 

with the output being:

 +----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

If you would like to change the schema of the table based on your first query, you can

1. Execute Spark SQL such as

df.createOrReplaceTempView("df")
df2 = spark.sql("select cast(age as int) as age, cast(name as string) as name from df")

2. Use PySpark DataFrame to cast the column/schemas

from pyspark.sql.types import IntegerType
df2 = df.withColumn("age", df["age"].cast(IntegerType()))

HTH!

bhanu2448
New Contributor II

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group