Databricks Community

ChristianRRL · ‎10-07-2025

Bit of a silly question, but wondering if someone can help me better understand what is `read_files`?

read_files table-valued function | Databricks on AWS

There's at least 3 ways to pull raw json data into a spark dataframe:

df = spark.read...
df = spark.readStream... (i.e. AutoLoader)
select * from read_files(...)

I'm curious, is read_files a Databricks SQL specific function, or is it native to Spark? Particularly, I'm curious about the `schemaHints` functionality that both AutoLoader & read_files support, but spark.read seemingly does not support (as far as I can tell).

Isi · ‎10-07-2025

Hello @ChristianRRL ,

No, read_files is not a native Spark function — it’s a Databricks SQL wrapper that allows you to read files easily using SQL syntax.

The main advantage is that it adds several Databricks-specific capabilities on top of Spark’s basic file reader, such as schema inference, schema hints, rescued data handling, and partition discovery.

For example:

SELECT * FROM read_files(
  's3://my-bucket/path/',
  format => 'json',
  schemaHints => 'user_id STRING, event_time TIMESTAMP'
);

is "equals" to:

spark.read.format("json").load("s3://my-bucket/path/")

but with the extra Databricks logic for schema management and ingestion governance.

Regarding schemaHints, it works the same way as in Auto Loader — it lets you override or enforce specific column types while leaving the rest of the schema inferred automatically. Docs

While spark.read in open-source Spark only allows you to fully define a schema or infer it entirely Docs, Databricks added schemaHints in this built-in function inside his DBR, you can override or enforce specific column types while letting the rest of the schema be inferred automatically.

Hope this helps, 🙂

Isi

View solution in original post

Isi · ‎10-07-2025

Hello @ChristianRRL ,

No, read_files is not a native Spark function — it’s a Databricks SQL wrapper that allows you to read files easily using SQL syntax.

The main advantage is that it adds several Databricks-specific capabilities on top of Spark’s basic file reader, such as schema inference, schema hints, rescued data handling, and partition discovery.

For example:

SELECT * FROM read_files(
  's3://my-bucket/path/',
  format => 'json',
  schemaHints => 'user_id STRING, event_time TIMESTAMP'
);

is "equals" to:

spark.read.format("json").load("s3://my-bucket/path/")

but with the extra Databricks logic for schema management and ingestion governance.

Regarding schemaHints, it works the same way as in Auto Loader — it lets you override or enforce specific column types while leaving the rest of the schema inferred automatically. Docs

While spark.read in open-source Spark only allows you to fully define a schema or infer it entirely Docs, Databricks added schemaHints in this built-in function inside his DBR, you can override or enforce specific column types while letting the rest of the schema be inferred automatically.

Hope this helps, 🙂

Isi

BS_THE_ANALYST · ‎10-07-2025

Also, @ChristianRRL , with a slight adjustment to the syntax, it does indeed behave like Autoloader
https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/patterns?language=SQL

I'd also advise looking at the different options that Autoloader has when working with cloud storage i.e. Directory Listing Mode and File notification mode (recommended): https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes

All the best,
BS

Databricks Community

What is `read_files`?

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Portland Data + AI Meetup — Holiday Event - Wednesday, December 3rd