- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-15-2024 09:50 AM
Hi there,
I'm trying to understand if there's an easy way to create VIEWS and TABLES (I'm interested in both) *WITH* a provided schema file. For example, I understand that via dataframes I can accomplish this via something like this:
df = spark.read.schema(_schema).json(f'{dbfs_path}/{entity}.*json', multiLine=True).select('*','_metadata')
HOWEVER, I am struggling to find a way to do this exclusively via Spark SQL without needing to create a spark dataframe first.
Currently, I only see two approaches via Spark SQL: read_files and "using_json"
select *,_metadata
from read_files(
'dbfs:/mnt/.../entity.*json'
,format => 'json'
,schema => '...'
,schemaHints => '...'
,rescuedDataColumn => '_rescued_data'
)
;
CREATE OR REPLACE TEMPORARY VIEW entity_raw
USING json
OPTIONS (path='dbfs:/mnt/.../entity.*json', multiline=true)
;
Out of these options, "USING json" can clearly be used to create temporary views (and I presume permanent views as well), but it doesn't seem to support providing *CUSTOM* schema definitions (e.g. I have a filepath with ./schema/entity.json with explicit schema definitions).
On the other hand, the "read_files" option does support "manually entered" schema definitions... however, I don't see a way to provide an easy schema file (e.g. ./schema/entity.json) to fully define the schema. In a sense, with this option I want to avoid manually entering long and complex schema definitions.
Can someone help me understand if (A) I'm missing something with either of these options to provide a schemaLocation rather than needing to manually write out the schemas, or (B) if there's an "easy" way to convert a schema file (entity.json) into the proper format to include it with the read_files option.
Thanks in advance!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-15-2024 10:24 AM
Unfortunately, there isn't a built-in Spark SQL method to directly use a schema file when creating views or tables.
You could try to create a custom SQL function that reads the schema file and returns it as a string. Then use this function in your SQL commands.
- First, create a custom UDF (User Defined Function) in Scala or Python that reads the schema file:
import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.udf def readSchemaFile(path: String): String = { // Read the schema file and return its contents as a string scala.io.Source.fromFile(path).getLines().mkString } val spark = SparkSession.builder().getOrCreate() spark.udf.register("read_schema_file", readSchemaFile _)
- Then use this function in your SQL command:
CREATE OR REPLACE VIEW entity_view AS SELECT * FROM read_files( 'dbfs:/mnt/.../entity.*json', format => 'json', schema => read_schema_file('/path/to/schema/entity.json'), rescuedDataColumn => '_rescued_data' )
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-15-2024 10:49 AM - edited 11-15-2024 10:53 AM
Hi @ChristianRRL ,
(A) You're not missing anything, there's no such an option as of today for SQL API.
(B) It would be much better for you to just use pyspark, but if you have to stick to just SQL API you can use following aproach. Define your schema using DDL String and store that in simple text file.
Then you can prepare variable called schema. Read into that variable content of schema text file and as a last step, pass schema variable to read_files.
%sql
DECLARE OR REPLACE VARIABLE schema STRING;
select value as schema
from read_files(
'dbfs:/FileStore/schema.txt'
,format => 'text'
)
select *,_metadata
from read_files(
'dbfs:/FileStore/Test.csv'
,format => 'csv'
,schema => schema
,rescuedDataColumn => '_rescued_data'
)
;
Here is an example content of schema file. It can be arbitraly complex, but for the sake of an example I only attached simple data types. Storing it as DDL String in text file allows you to avoid using udf. The drawback is, that you lose some readability when the schema is defined using ddl string.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-15-2024 10:24 AM
Unfortunately, there isn't a built-in Spark SQL method to directly use a schema file when creating views or tables.
You could try to create a custom SQL function that reads the schema file and returns it as a string. Then use this function in your SQL commands.
- First, create a custom UDF (User Defined Function) in Scala or Python that reads the schema file:
import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.udf def readSchemaFile(path: String): String = { // Read the schema file and return its contents as a string scala.io.Source.fromFile(path).getLines().mkString } val spark = SparkSession.builder().getOrCreate() spark.udf.register("read_schema_file", readSchemaFile _)
- Then use this function in your SQL command:
CREATE OR REPLACE VIEW entity_view AS SELECT * FROM read_files( 'dbfs:/mnt/.../entity.*json', format => 'json', schema => read_schema_file('/path/to/schema/entity.json'), rescuedDataColumn => '_rescued_data' )
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-15-2024 10:43 AM
Oh WOW. Although I'm a bit disappointed that Spark SQL doesn't have this functionality built-in, this is an excellent tip!
I haven't turned into a udf quite yet, but I have the proper building blocks. For example, below I can extract the raw schema StructType from my desired entity.json schema file. And I just found out I can turn it into a `simpleString()` which I will be able to use directly in the read_files VIEW!
Thanks a ton!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-15-2024 10:49 AM - edited 11-15-2024 10:53 AM
Hi @ChristianRRL ,
(A) You're not missing anything, there's no such an option as of today for SQL API.
(B) It would be much better for you to just use pyspark, but if you have to stick to just SQL API you can use following aproach. Define your schema using DDL String and store that in simple text file.
Then you can prepare variable called schema. Read into that variable content of schema text file and as a last step, pass schema variable to read_files.
%sql
DECLARE OR REPLACE VARIABLE schema STRING;
select value as schema
from read_files(
'dbfs:/FileStore/schema.txt'
,format => 'text'
)
select *,_metadata
from read_files(
'dbfs:/FileStore/Test.csv'
,format => 'csv'
,schema => schema
,rescuedDataColumn => '_rescued_data'
)
;
Here is an example content of schema file. It can be arbitraly complex, but for the sake of an example I only attached simple data types. Storing it as DDL String in text file allows you to avoid using udf. The drawback is, that you lose some readability when the schema is defined using ddl string.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-15-2024 12:24 PM
This is another great tip. Thank you for sharing!
One small follow-up, I looked into this a while back but didn't have luck with it at the time. Regarding variables in Spark SQL, is there a way to initialize a variable outside of the %sql cell? For example, let's say I initialize a variable in Python, but I want to use the variable in %sql..
Is this possible?

