CREATE view USING json and *include* _metadata, _rescued_data

ChristianRRL
Honored Contributor

Title may be self-explanatory. Basically, I'm curious to ask if it's possible (and if so how) to add `_metadata` and `_rescued_data` fields to a view "using json".

e.g.

 

%sql

CREATE OR REPLACE VIEW entity_view
USING json
OPTIONS (path="/.../.*json",multiline=true)

 

Below I show it's possible to do this with `read_files` in Spark SQL:

ChristianRRL_0-1731949214474.png

Below I show an error trying to query these fields.. not sure if I'm doing something wrong:

ChristianRRL_1-1731949348303.png

 

 

ChristianRRL
Honored Contributor

Forgot to add this as reference:

The JSON Files documentation doesn't seem to show any OPTIONS that may enable the functionality that I'm looking for.. But please feel free to correct me if there's a way to achieve what I'm looking for that I may be overlooking!

Nam_Nguyen
Databricks Employee
Databricks Employee

Hi @ChristianRRL , I'll be looking into this, and I'll get back to you with an answer

Nam_Nguyen
Databricks Employee
Databricks Employee

Hi @ChristianRRL , as a first quick look, could you please try to create a PySpark dataframe with the _metadata and _rescued_data columns, query the dataframe to make sure you can see those columns, and then create a view using this dataframe?

akhil393
Databricks Employee
Databricks Employee

Hi @ChristianRRL You can still use the same method read_files when creating the view, I see that you are using classic hive style reader instead of using the read_files in the actual view definition of sql and you don't need to use spark.sql, please see below. 

akhil393_0-1732290590854.jpeg

Example code: 

create or replace view json_view
as
SELECT _metadata, *
FROM read_files(
's3://********/_delta_log/*.json',
format => 'json')

 

NandiniN
Databricks Employee
Databricks Employee
I am able to perform the below operation for a delta table.
SELECT *,_metadata.file_name FROM anytable where condition.

https://docs.databricks.com/en/ingestion/file-metadata-column.html

You can use something like 

df = spark.read \
  .format("json") \
  .schema(schema) \
  .load("dbfs:/tmp/*") \
  .select("*", "_metadata")

display(df)

 

. You can enable the rescued data column by setting the option rescuedDataColumn to a column name, such as _rescued_data with spark.read.option("rescuedDataColumn", "_rescued_data").format("json").load(<path>).

https://docs.databricks.com/en/query/formats/json.html#rescued-data-column