Databricks

Dee · ‎08-14-2018

I am new to Spark and just started an online pyspark tutorial. I uploaded the json data in DataBrick and wrote the commands as follows:

df = sqlContext.sql("SELECT * FROM people_json")

df.printSchema()

from pyspark.sql.types import *

data_schema = [StructField('age',IntegerType(),True), StructField('name',StringType(),True)]

final_struc = StructType(fields=data_schema)

###Tutorial says to run this command

df = spark.read.json('people_json',schema=final_struc)

###But this is not working. Why this is not working ? And what will work ? Thanks!

dennyglee · ‎08-14-2018

The first part of your query

df = sqlContext.sql("SELECT * FROM people_json")
df.printSchema()

is create the

df

DataFrame by reading an existing table.

The second part of your query is using

spark.read.json

which is expecting a file. For example, the following code does work:

from pyspark.sql.types import *
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields=data_schema)
df = spark.read.json("/my/directory/people.json", schema=final_struc)
df.show()

with the output being:

 +----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

If you would like to change the schema of the table based on your first query, you can

1. Execute Spark SQL such as

df.createOrReplaceTempView("df")
df2 = spark.sql("select cast(age as int) as age, cast(name as string) as name from df")

2. Use PySpark DataFrame to cast the column/schemas

from pyspark.sql.types import IntegerType
df2 = df.withColumn("age", df["age"].cast(IntegerType()))

HTH!

View solution in original post

dennyglee · ‎08-14-2018

The first part of your query

df = sqlContext.sql("SELECT * FROM people_json")
df.printSchema()

is create the

df

DataFrame by reading an existing table.

The second part of your query is using

spark.read.json

which is expecting a file. For example, the following code does work:

from pyspark.sql.types import *
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields=data_schema)
df = spark.read.json("/my/directory/people.json", schema=final_struc)
df.show()

with the output being:

 +----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

If you would like to change the schema of the table based on your first query, you can

1. Execute Spark SQL such as

df.createOrReplaceTempView("df")
df2 = spark.sql("select cast(age as int) as age, cast(name as string) as name from df")

2. Use PySpark DataFrame to cast the column/schemas

from pyspark.sql.types import IntegerType
df2 = df.withColumn("age", df["age"].cast(IntegerType()))

HTH!

bhanu2448 · ‎07-20-2019

http://www.bigdatainterview.com/

Databricks

How to Change Schema of a Spark SQL

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI