The first part of your query
df = sqlContext.sql("SELECT * FROM people_json")
df.printSchema()
is create the
df
DataFrame by reading an existing table.
The second part of your query is using
spark.read.json
which is expecting a file. For example, the following code does work:
from pyspark.sql.types import *
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields=data_schema)
df = spark.read.json("/my/directory/people.json", schema=final_struc)
df.show()
with the output being:
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
If you would like to change the schema of the table based on your first query, you can
1. Execute Spark SQL such as
df.createOrReplaceTempView("df")
df2 = spark.sql("select cast(age as int) as age, cast(name as string) as name from df")
2. Use PySpark DataFrame to cast the column/schemas
from pyspark.sql.types import IntegerType
df2 = df.withColumn("age", df["age"].cast(IntegerType()))
HTH!