11-15-2024 09:50 AM
Hi there,
I'm trying to understand if there's an easy way to create VIEWS and TABLES (I'm interested in both) *WITH* a provided schema file. For example, I understand that via dataframes I can accomplish this via something like this:
df = spark.read.schema(_schema).json(f'{dbfs_path}/{entity}.*json', multiLine=True).select('*','_metadata')
HOWEVER, I am struggling to find a way to do this exclusively via Spark SQL without needing to create a spark dataframe first.
Currently, I only see two approaches via Spark SQL: read_files and "using_json"
select *,_metadata
from read_files(
'dbfs:/mnt/.../entity.*json'
,format => 'json'
,schema => '...'
,schemaHints => '...'
,rescuedDataColumn => '_rescued_data'
)
;
CREATE OR REPLACE TEMPORARY VIEW entity_raw
USING json
OPTIONS (path='dbfs:/mnt/.../entity.*json', multiline=true)
;
Out of these options, "USING json" can clearly be used to create temporary views (and I presume permanent views as well), but it doesn't seem to support providing *CUSTOM* schema definitions (e.g. I have a filepath with ./schema/entity.json with explicit schema definitions).
On the other hand, the "read_files" option does support "manually entered" schema definitions... however, I don't see a way to provide an easy schema file (e.g. ./schema/entity.json) to fully define the schema. In a sense, with this option I want to avoid manually entering long and complex schema definitions.
Can someone help me understand if (A) I'm missing something with either of these options to provide a schemaLocation rather than needing to manually write out the schemas, or (B) if there's an "easy" way to convert a schema file (entity.json) into the proper format to include it with the read_files option.
Thanks in advance!
11-15-2024 10:24 AM
Unfortunately, there isn't a built-in Spark SQL method to directly use a schema file when creating views or tables.
You could try to create a custom SQL function that reads the schema file and returns it as a string. Then use this function in your SQL commands.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.udf
def readSchemaFile(path: String): String = {
// Read the schema file and return its contents as a string
scala.io.Source.fromFile(path).getLines().mkString
}
val spark = SparkSession.builder().getOrCreate()
spark.udf.register("read_schema_file", readSchemaFile _)
CREATE OR REPLACE VIEW entity_view AS
SELECT *
FROM read_files(
'dbfs:/mnt/.../entity.*json',
format => 'json',
schema => read_schema_file('/path/to/schema/entity.json'),
rescuedDataColumn => '_rescued_data'
)
11-15-2024 10:49 AM - edited 11-15-2024 10:53 AM
Hi @ChristianRRL ,
(A) You're not missing anything, there's no such an option as of today for SQL API.
(B) It would be much better for you to just use pyspark, but if you have to stick to just SQL API you can use following aproach. Define your schema using DDL String and store that in simple text file.
Then you can prepare variable called schema. Read into that variable content of schema text file and as a last step, pass schema variable to read_files.
%sql
DECLARE OR REPLACE VARIABLE schema STRING;
select value as schema
from read_files(
'dbfs:/FileStore/schema.txt'
,format => 'text'
)
select *,_metadata
from read_files(
'dbfs:/FileStore/Test.csv'
,format => 'csv'
,schema => schema
,rescuedDataColumn => '_rescued_data'
)
;
Here is an example content of schema file. It can be arbitraly complex, but for the sake of an example I only attached simple data types. Storing it as DDL String in text file allows you to avoid using udf. The drawback is, that you lose some readability when the schema is defined using ddl string.
11-15-2024 10:24 AM
Unfortunately, there isn't a built-in Spark SQL method to directly use a schema file when creating views or tables.
You could try to create a custom SQL function that reads the schema file and returns it as a string. Then use this function in your SQL commands.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.udf
def readSchemaFile(path: String): String = {
// Read the schema file and return its contents as a string
scala.io.Source.fromFile(path).getLines().mkString
}
val spark = SparkSession.builder().getOrCreate()
spark.udf.register("read_schema_file", readSchemaFile _)
CREATE OR REPLACE VIEW entity_view AS
SELECT *
FROM read_files(
'dbfs:/mnt/.../entity.*json',
format => 'json',
schema => read_schema_file('/path/to/schema/entity.json'),
rescuedDataColumn => '_rescued_data'
)
11-15-2024 10:43 AM
Oh WOW. Although I'm a bit disappointed that Spark SQL doesn't have this functionality built-in, this is an excellent tip!
I haven't turned into a udf quite yet, but I have the proper building blocks. For example, below I can extract the raw schema StructType from my desired entity.json schema file. And I just found out I can turn it into a `simpleString()` which I will be able to use directly in the read_files VIEW!
Thanks a ton!
11-15-2024 10:49 AM - edited 11-15-2024 10:53 AM
Hi @ChristianRRL ,
(A) You're not missing anything, there's no such an option as of today for SQL API.
(B) It would be much better for you to just use pyspark, but if you have to stick to just SQL API you can use following aproach. Define your schema using DDL String and store that in simple text file.
Then you can prepare variable called schema. Read into that variable content of schema text file and as a last step, pass schema variable to read_files.
%sql
DECLARE OR REPLACE VARIABLE schema STRING;
select value as schema
from read_files(
'dbfs:/FileStore/schema.txt'
,format => 'text'
)
select *,_metadata
from read_files(
'dbfs:/FileStore/Test.csv'
,format => 'csv'
,schema => schema
,rescuedDataColumn => '_rescued_data'
)
;
Here is an example content of schema file. It can be arbitraly complex, but for the sake of an example I only attached simple data types. Storing it as DDL String in text file allows you to avoid using udf. The drawback is, that you lose some readability when the schema is defined using ddl string.
11-15-2024 12:24 PM
This is another great tip. Thank you for sharing!
One small follow-up, I looked into this a while back but didn't have luck with it at the time. Regarding variables in Spark SQL, is there a way to initialize a variable outside of the %sql cell? For example, let's say I initialize a variable in Python, but I want to use the variable in %sql..
Is this possible?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group