Databricks Community

ChristianRRL · ‎11-18-2024

Title may be self-explanatory. Basically, I'm curious to ask if it's possible (and if so how) to add `_metadata` and `_rescued_data` fields to a view "using json".

e.g.

%sql

CREATE OR REPLACE VIEW entity_view
USING json
OPTIONS (path="/.../.*json",multiline=true)

Below I show it's possible to do this with `read_files` in Spark SQL:

Below I show an error trying to query these fields.. not sure if I'm doing something wrong:

ChristianRRL · ‎11-18-2024

Forgot to add this as reference:

JSON Files - Spark 3.5.3 Documentation

The JSON Files documentation doesn't seem to show any OPTIONS that may enable the functionality that I'm looking for.. But please feel free to correct me if there's a way to achieve what I'm looking for that I may be overlooking!

Nam_Nguyen · ‎11-21-2024

Hi @ChristianRRL , I'll be looking into this, and I'll get back to you with an answer

Nam_Nguyen · ‎11-22-2024

Hi @ChristianRRL , as a first quick look, could you please try to create a PySpark dataframe with the _metadata and _rescued_data columns, query the dataframe to make sure you can see those columns, and then create a view using this dataframe?

akhil393 · ‎11-22-2024

Hi @ChristianRRL You can still use the same method read_files when creating the view, I see that you are using classic hive style reader instead of using the read_files in the actual view definition of sql and you don't need to use spark.sql, please see below.

Example code:

create or replace view json_view

as

SELECT _metadata, *

FROM read_files(

's3://********/_delta_log/*.json',

format => 'json')

NandiniN · ‎01-09-2025

I am able to perform the below operation for a delta table.

SELECT *,_metadata.file_name FROM anytable where condition.

https://docs.databricks.com/en/ingestion/file-metadata-column.html

You can use something like

df = spark.read \
  .format("json") \
  .schema(schema) \
  .load("dbfs:/tmp/*") \
  .select("*", "_metadata")

display(df)

. You can enable the rescued data column by setting the option rescuedDataColumn to a column name, such as _rescued_data with spark.read.option("rescuedDataColumn", "_rescued_data").format("json").load(<path>).

https://docs.databricks.com/en/query/formats/json.html#rescued-data-column