cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark SQL: USING JSON to create VIEWS/TABLES with existing schema file

ChristianRRL
Valued Contributor

Hi there,

I'm trying to understand if there's an easy way to create VIEWS and TABLES (I'm interested in both) *WITH* a provided schema file. For example, I understand that via dataframes I can accomplish this via something like this:

df = spark.read.schema(_schema).json(f'{dbfs_path}/{entity}.*json', multiLine=True).select('*','_metadata')

HOWEVER, I am struggling to find a way to do this exclusively via Spark SQL without needing to create a spark dataframe first.

Currently, I only see two approaches via Spark SQL: read_files and "using_json"

select *,_metadata
from read_files(
  'dbfs:/mnt/.../entity.*json'
  ,format => 'json'
  ,schema => '...'
  ,schemaHints => '...'
  ,rescuedDataColumn => '_rescued_data'
  )
;
CREATE OR REPLACE TEMPORARY VIEW entity_raw
USING json
OPTIONS (path='dbfs:/mnt/.../entity.*json', multiline=true)
;

Out of these options, "USING json" can clearly be used to create temporary views (and I presume permanent views as well), but it doesn't seem to support providing *CUSTOM* schema definitions (e.g. I have a filepath with ./schema/entity.json with explicit schema definitions).

On the other hand, the "read_files" option does support "manually entered" schema definitions... however, I don't see a way to provide an easy schema file (e.g. ./schema/entity.json) to fully define the schema. In a sense, with this option I want to avoid manually entering long and complex schema definitions.

Can someone help me understand if (A) I'm missing something with either of these options to provide a schemaLocation rather than needing to manually write out the schemas, or (B) if there's an "easy" way to convert a schema file (entity.json) into the proper format to include it with the read_files option.

Thanks in advance!

2 ACCEPTED SOLUTIONS

Accepted Solutions

Walter_C
Databricks Employee
Databricks Employee

Unfortunately, there isn't a built-in Spark SQL method to directly use a schema file when creating views or tables.

You could try to create a custom SQL function that reads the schema file and returns it as a string. Then use this function in your SQL commands.

  1. First, create a custom UDF (User Defined Function) in Scala or Python that reads the schema file:

    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.functions.udf
    
    def readSchemaFile(path: String): String = {
      // Read the schema file and return its contents as a string
      scala.io.Source.fromFile(path).getLines().mkString
    }
    
    val spark = SparkSession.builder().getOrCreate()
    spark.udf.register("read_schema_file", readSchemaFile _)


  2. Then use this function in your SQL command:
    CREATE OR REPLACE VIEW entity_view AS
    SELECT *
    FROM read_files(
      'dbfs:/mnt/.../entity.*json',
      format => 'json',
      schema => read_schema_file('/path/to/schema/entity.json'),
      rescuedDataColumn => '_rescued_data'
    )

 

View solution in original post

szymon_dybczak
Contributor III

Hi @ChristianRRL ,

(A) You're not missing anything, there's no such an option as of today for SQL API.  
(B) It would be much better for you to just use pyspark, but if you have to stick to just SQL API you can use following aproach. Define your schema using DDL String and store that in simple text file. 
Then you can prepare variable called schema. Read into that variable content of schema text file and as a last step, pass schema variable to read_files.

 

 

%sql
DECLARE OR REPLACE VARIABLE schema STRING;

select value as schema
from read_files(
  'dbfs:/FileStore/schema.txt'
  ,format => 'text' 
  )

select *,_metadata
from read_files(
  'dbfs:/FileStore/Test.csv'
  ,format => 'csv'
  ,schema => schema
  ,rescuedDataColumn => '_rescued_data'
  )
;

 

 

 Here is an example content of schema file. It can be arbitraly complex, but for the sake of an example I only attached simple data types. Storing it as DDL String in text file allows you to avoid using udf. The drawback is, that you lose some readability when the schema is defined using ddl string.

szymon_dybczak_0-1731696424201.png

 

View solution in original post

4 REPLIES 4

Walter_C
Databricks Employee
Databricks Employee

Unfortunately, there isn't a built-in Spark SQL method to directly use a schema file when creating views or tables.

You could try to create a custom SQL function that reads the schema file and returns it as a string. Then use this function in your SQL commands.

  1. First, create a custom UDF (User Defined Function) in Scala or Python that reads the schema file:

    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.functions.udf
    
    def readSchemaFile(path: String): String = {
      // Read the schema file and return its contents as a string
      scala.io.Source.fromFile(path).getLines().mkString
    }
    
    val spark = SparkSession.builder().getOrCreate()
    spark.udf.register("read_schema_file", readSchemaFile _)


  2. Then use this function in your SQL command:
    CREATE OR REPLACE VIEW entity_view AS
    SELECT *
    FROM read_files(
      'dbfs:/mnt/.../entity.*json',
      format => 'json',
      schema => read_schema_file('/path/to/schema/entity.json'),
      rescuedDataColumn => '_rescued_data'
    )

 

Oh WOW. Although I'm a bit disappointed that Spark SQL doesn't have this functionality built-in, this is an excellent tip!

I haven't turned into a udf quite yet, but I have the proper building blocks. For example, below I can extract the raw schema StructType from my desired entity.json schema file. And I just found out I can turn it into a `simpleString()` which I will be able to use directly in the read_files VIEW!

ChristianRRL_0-1731696015324.png

Thanks a ton!

 

szymon_dybczak
Contributor III

Hi @ChristianRRL ,

(A) You're not missing anything, there's no such an option as of today for SQL API.  
(B) It would be much better for you to just use pyspark, but if you have to stick to just SQL API you can use following aproach. Define your schema using DDL String and store that in simple text file. 
Then you can prepare variable called schema. Read into that variable content of schema text file and as a last step, pass schema variable to read_files.

 

 

%sql
DECLARE OR REPLACE VARIABLE schema STRING;

select value as schema
from read_files(
  'dbfs:/FileStore/schema.txt'
  ,format => 'text' 
  )

select *,_metadata
from read_files(
  'dbfs:/FileStore/Test.csv'
  ,format => 'csv'
  ,schema => schema
  ,rescuedDataColumn => '_rescued_data'
  )
;

 

 

 Here is an example content of schema file. It can be arbitraly complex, but for the sake of an example I only attached simple data types. Storing it as DDL String in text file allows you to avoid using udf. The drawback is, that you lose some readability when the schema is defined using ddl string.

szymon_dybczak_0-1731696424201.png

 

This is another great tip. Thank you for sharing!

One small follow-up, I looked into this a while back but didn't have luck with it at the time. Regarding variables in Spark SQL, is there a way to initialize a variable outside of the %sql cell? For example, let's say I initialize a variable in Python, but I want to use the variable in %sql..

Is this possible?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group